# <img style="float: left; padding-right: 100px; width: 300px" src="../image/logo.png">AI4SG Bootcamp:



## Introduction to Pandas for Data processing and Analysis


**Authors:** Faustine


---

To use Panda first load the panda library

In [1]:
import numpy as np
import pandas as pd

## 1. Introduction to Pandas 

pandas the most powerful and flexible python library for data analysis and modelling. It contains *data structures* and *data manipulation* tools designed to make data cleaning and analysis fast and easy in Python. Pandas adopts significant
parts of NumPy’s idiomatic style of array-based computing, especially array-based
functions and a preference for data processing without for loops. While pandas adopts many coding idioms from NumPy, the biggest difference is that:pandas is designed for working with tabular or heterogeneous data where NumPy, by con‐ trast, is best suited for working with homogeneous numerical array data

### Pandas  data structure
Pandas provides two fundamental data objects, for 1D (``Series``) and 2D data (``DataFrame``). 

A ``Series`` is a one-dimensional array-like object containing a sequence of values (of
similar types to NumPy types) and an associated array of data labels, called its index.
The simplest Series is formed from only an array of data:



In [2]:
s = pd.Series([0.1, 0.2, 0.3, 0.4])
s

0    0.1
1    0.2
2    0.3
3    0.4
dtype: float64

The series also has an index, which by default is the numbers 0* through *N - 1 (but no .columns):

In [3]:
s.index

RangeIndex(start=0, stop=4, step=1)

You can access the underlying numpy array representation with the `.values` attribute:

In [4]:
s.values

array([0.1, 0.2, 0.3, 0.4])

A `DataFrame` is a **tablular data structure** (multi-dimensional object to hold labeled data) comprised of rows and columns, akin to a spreadsheet, database table, or R's data.frame object. You can think of it as multiple Series object which share the same index.

For the examples here, we are going to create a small DataFrame with some data about a few countries. When creating a DataFrame manually, a common way to do this is from dictionary of arrays or lists:

In [28]:
data = {'Region': ['Mwanza', 'Dar es salaam', 'Arusha', 'Mbeya', 'Dodoma'],
        'population': [706543, 4364541, 416442, 385279, 213636],
        'area': [9467, 1590, 37576, 35954, 2576]}
cities = pd.DataFrame(data)
cities

Unnamed: 0,Region,population,area
0,Mwanza,706543,9467
1,Dar es salaam,4364541,1590
2,Arusha,416442,37576
3,Mbeya,385279,35954
4,Dodoma,213636,2576


The DataFrame has a built-in concept of named rows and columns, the **`index`** and **`columns`** attributes:

In [6]:
cities.index

RangeIndex(start=0, stop=5, step=1)

In [7]:
cities.columns

Index(['Region', 'population', 'area (km square)'], dtype='object')

To check the data types of the different columns:

In [8]:
cities.dtypes

Region              object
population           int64
area (km square)     int64
dtype: object

A DataFrame has also a `values` attribute, but attention: when you have heterogeneous data, all values will be upcasted:

In [9]:
cities.values

array([['Mwanza', 706543, 9467],
       ['Dar es salaam', 4364541, 1590],
       ['Arusha', 416442, 37576],
       ['Mbeya', 385279, 35954],
       ['Dodoma', 213636, 2576]], dtype=object)

## 2. Pandas IO Tools  and basic operations

A wide range of [input/output formats](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-perf) are natively supported by pandas:

* CSV, text
* SQL database
* Excel
* HDF5
* json
* html
* pickle
* sas, stata
* Parquet
* ...


### Reading from  CSV file 

In [39]:
data = pd.read_csv('data/primary.csv')

Exploration of the Series and DataFrame is essential (check out what you're dealing with). 

In [11]:
# Let view the first few rows 
data.head()

Unnamed: 0,REGION,DISTRICT,MALE,FEMALE
0,ARUSHA,ARUSHA RURAL,32475.0,33698.0
1,ARUSHA,ARUSHA URBAN,36315.0,36993.0
2,ARUSHA,KARATU,23303.0,23181.0
3,ARUSHA,LONGIDO,10584.0,9045.0
4,ARUSHA,MERU,33854.0,34171.0


In [12]:
# View Last 6 rows
data.tail(6)

Unnamed: 0,REGION,DISTRICT,MALE,FEMALE
157,TANGA,KOROGWE URBAN,6240.0,6057.0
158,TANGA,LUSHOTO,69395.0,70238.0
159,TANGA,MKINGA,12799.0,12568.0
160,TANGA,MUHEZA,19672.0,19650.0
161,TANGA,PANGANI,4903.0,4925.0
162,TANGA,TANGA URBAN,26555.0,26487.0


In [13]:
# List all the columns in the DataFrame
data.columns

Index(['REGION', 'DISTRICT', 'MALE', 'FEMALE'], dtype='object')

In [14]:
# We can use the len function again here to see how many rows there are in the dataframe: 163
len(data)

163

In [15]:
# How big is this dataframe (rows, columns)
data.shape

(163, 4)

Notice that read_csv automatically considered the first row in the file to be a header row.
We can override default behavior by customizing some the arguments, like header, names or index_col.

<div class="alert alert-success">
    <b>Activity 1</b>: Load the university enrollment data  as a pandas data frame and inspect the first 5 rows. How many row does the data set contain?
</div>
 

###  Adding and Droping column
Let us add another column to the primary dataframe. Suppose we want to add total enrollment column

In [16]:
data['TOTAL']=data['MALE']+ data["FEMALE"]

In [17]:
data.head()

Unnamed: 0,REGION,DISTRICT,MALE,FEMALE,TOTAL
0,ARUSHA,ARUSHA RURAL,32475.0,33698.0,66173.0
1,ARUSHA,ARUSHA URBAN,36315.0,36993.0,73308.0
2,ARUSHA,KARATU,23303.0,23181.0,46484.0
3,ARUSHA,LONGIDO,10584.0,9045.0,19629.0
4,ARUSHA,MERU,33854.0,34171.0,68025.0


It clear from the above result we can perform arthmetic operation on pandas dataframe.

### Dropping Column

We can  delete column in panda dataframe. Let us delete the TOTAL column in primary enrollment dataframe.

In [18]:
data.drop('TOTAL', axis=1, inplace=True)
data.head()

Unnamed: 0,REGION,DISTRICT,MALE,FEMALE
0,ARUSHA,ARUSHA RURAL,32475.0,33698.0
1,ARUSHA,ARUSHA URBAN,36315.0,36993.0
2,ARUSHA,KARATU,23303.0,23181.0
3,ARUSHA,LONGIDO,10584.0,9045.0
4,ARUSHA,MERU,33854.0,34171.0


#### Note:
 1. **axis=1** denotes that we are referring to a column, not a row
 2. **inplace=True** means that the changes are saved to the df right away

<div class="alert alert-success">
    <b>Activity 2</b>: Add Total enrollment in university dataframe. Drop the Total you created and gender columns in university dataset
</div>
 

**Hint**: To delete multiple column use *dataframe.drop(['Column_name1', 'Column_name2'], axis=1)*.

### Data selection and slicing

Selecting a single column - returns a 'series'

In [19]:
data.REGION
# You can aslo use data['REGION']

0             ARUSHA
1             ARUSHA
2             ARUSHA
3             ARUSHA
4             ARUSHA
5             ARUSHA
6             ARUSHA
7      DAR ES SALAAM
8      DAR ES SALAAM
9      DAR ES SALAAM
10            DODOMA
11            DODOMA
12            DODOMA
13            DODOMA
14            DODOMA
15            DODOMA
16            DODOMA
17             GEITA
18             GEITA
19             GEITA
20             GEITA
21             GEITA
22             GEITA
23            IRINGA
24            IRINGA
25            IRINGA
26            IRINGA
27            KAGERA
28            KAGERA
29            KAGERA
           ...      
133        SHINYANGA
134           SIMIYU
135           SIMIYU
136           SIMIYU
137           SIMIYU
138           SIMIYU
139           SIMIYU
140          SINGIDA
141          SINGIDA
142          SINGIDA
143          SINGIDA
144          SINGIDA
145          SINGIDA
146           TABORA
147           TABORA
148           TABORA
149          

In [20]:
# To select column as data frame
data[['DISTRICT']]

Unnamed: 0,DISTRICT
0,ARUSHA RURAL
1,ARUSHA URBAN
2,KARATU
3,LONGIDO
4,MERU
5,MONDULI
6,NGORONGORO
7,ILALA
8,KINONDONI
9,TEMEKE


 Selecting multiple columns - returns a dataframe


In [21]:
data[['REGION','DISTRICT']]

Unnamed: 0,REGION,DISTRICT
0,ARUSHA,ARUSHA RURAL
1,ARUSHA,ARUSHA URBAN
2,ARUSHA,KARATU
3,ARUSHA,LONGIDO
4,ARUSHA,MERU
5,ARUSHA,MONDULI
6,ARUSHA,NGORONGORO
7,DAR ES SALAAM,ILALA
8,DAR ES SALAAM,KINONDONI
9,DAR ES SALAAM,TEMEKE


Selecting rows by number

In [22]:
data[15:20]

Unnamed: 0,REGION,DISTRICT,MALE,FEMALE
15,DODOMA,KONGWA,27169.0,30264.0
16,DODOMA,MPWAPWA,32176.0,31646.0
17,GEITA,BUKOMBE,39490.0,40178.0
18,GEITA,CHATO,39377.0,38393.0
19,GEITA,GEITA,115698.0,116328.0


In [23]:
# Try  data[:8] and data[100:]
data[:8]

Unnamed: 0,REGION,DISTRICT,MALE,FEMALE
0,ARUSHA,ARUSHA RURAL,32475.0,33698.0
1,ARUSHA,ARUSHA URBAN,36315.0,36993.0
2,ARUSHA,KARATU,23303.0,23181.0
3,ARUSHA,LONGIDO,10584.0,9045.0
4,ARUSHA,MERU,33854.0,34171.0
5,ARUSHA,MONDULI,13689.0,12888.0
6,ARUSHA,NGORONGORO,16017.0,12162.0
7,DAR ES SALAAM,ILALA,73129.0,78309.0


### Indexing
Pandas allows us to use position based indexing implemented with iloc 
`.loc` for label-based indexing
and loc
`.iloc` for positional indexing

We have mostly worked with DataFrames with the default *0, 1, 2, ... N* row labels (except for the time series data). But, we can also set one of the columns as the index.

Setting the index to the region names:

In [24]:
data = data.set_index('REGION')
data.head()

Unnamed: 0_level_0,DISTRICT,MALE,FEMALE
REGION,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ARUSHA,ARUSHA RURAL,32475.0,33698.0
ARUSHA,ARUSHA URBAN,36315.0,36993.0
ARUSHA,KARATU,23303.0,23181.0
ARUSHA,LONGIDO,10584.0,9045.0
ARUSHA,MERU,33854.0,34171.0


Reversing this operation, is `reset_index`:

In [25]:
data=data.reset_index('REGION')
data.head()

Unnamed: 0,REGION,DISTRICT,MALE,FEMALE
0,ARUSHA,ARUSHA RURAL,32475.0,33698.0
1,ARUSHA,ARUSHA URBAN,36315.0,36993.0
2,ARUSHA,KARATU,23303.0,23181.0
3,ARUSHA,LONGIDO,10584.0,9045.0
4,ARUSHA,MERU,33854.0,34171.0


### Selecting data based on the index

<div class="alert alert-warning" style="font-size:120%">
<b>ATTENTION!</b>: <br><br>

One of pandas' basic features is the labeling of rows and columns, but this makes indexing also a bit more complex compared to numpy. <br><br> We now have to distuinguish between:

* selection by **label** (using the row and column names)
* selection by **position** (using integers)

</div>

When using `[]` like above, you can only select from one axis at once (rows or columns, not both). For more advanced indexing, you have some extra attributes:
    
* `loc`: selection by label
* `iloc`: selection by position

Both `loc` and `iloc` use the following pattern: `df.loc[ <selection of the rows> , <selection of the columns> ]`.

This 'selection of the rows / columns' can be: a single label, a list of labels, a slice or a boolean mask.

Selecting a single element

Consider cities dataframe

In [29]:
cities = cities.set_index('Region')

In [33]:
cities.loc['Mwanza', 'population']

706543

But the row or column indexer can also be a list, slice, boolean array (see next section), ..

In [34]:
cities.loc['Mwanza':'Arusha', ['area', 'population']]

Unnamed: 0_level_0,area,population
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Mwanza,9467,706543
Dar es salaam,1590,4364541
Arusha,37576,416442


<div class="alert alert-danger">
<b>NOTE</b>:

* Unlike slicing in numpy, the end label is **included**!

</div>

Selecting by position with `iloc` works similar as **indexing numpy arrays**:

In [35]:
cities.iloc[0:2,1:3]

Unnamed: 0_level_0,area
Region,Unnamed: 1_level_1
Mwanza,9467
Dar es salaam,1590


The different indexing methods can also be used to **assign data**:

In [37]:
cities2 = cities.copy()
cities2.loc['Mwanza':'Arusha', 'population'] = 10000
cities2

Unnamed: 0_level_0,population,area
Region,Unnamed: 1_level_1,Unnamed: 2_level_1
Mwanza,10000,9467
Dar es salaam,10000,1590
Arusha,10000,37576
Mbeya,385279,35954
Dodoma,213636,2576


To slice a specific column using label indexing

In [40]:
# And here is how to slice a column:
data.loc[: , "REGION"]

0             ARUSHA
1             ARUSHA
2             ARUSHA
3             ARUSHA
4             ARUSHA
5             ARUSHA
6             ARUSHA
7      DAR ES SALAAM
8      DAR ES SALAAM
9      DAR ES SALAAM
10            DODOMA
11            DODOMA
12            DODOMA
13            DODOMA
14            DODOMA
15            DODOMA
16            DODOMA
17             GEITA
18             GEITA
19             GEITA
20             GEITA
21             GEITA
22             GEITA
23            IRINGA
24            IRINGA
25            IRINGA
26            IRINGA
27            KAGERA
28            KAGERA
29            KAGERA
           ...      
133        SHINYANGA
134           SIMIYU
135           SIMIYU
136           SIMIYU
137           SIMIYU
138           SIMIYU
139           SIMIYU
140          SINGIDA
141          SINGIDA
142          SINGIDA
143          SINGIDA
144          SINGIDA
145          SINGIDA
146           TABORA
147           TABORA
148           TABORA
149          

We can also use postion indexing

In [41]:
data.iloc[:,0] 

0             ARUSHA
1             ARUSHA
2             ARUSHA
3             ARUSHA
4             ARUSHA
5             ARUSHA
6             ARUSHA
7      DAR ES SALAAM
8      DAR ES SALAAM
9      DAR ES SALAAM
10            DODOMA
11            DODOMA
12            DODOMA
13            DODOMA
14            DODOMA
15            DODOMA
16            DODOMA
17             GEITA
18             GEITA
19             GEITA
20             GEITA
21             GEITA
22             GEITA
23            IRINGA
24            IRINGA
25            IRINGA
26            IRINGA
27            KAGERA
28            KAGERA
29            KAGERA
           ...      
133        SHINYANGA
134           SIMIYU
135           SIMIYU
136           SIMIYU
137           SIMIYU
138           SIMIYU
139           SIMIYU
140          SINGIDA
141          SINGIDA
142          SINGIDA
143          SINGIDA
144          SINGIDA
145          SINGIDA
146           TABORA
147           TABORA
148           TABORA
149          

To extract only a row you would do the inverse:

In [42]:
data.iloc[2,:]

REGION      ARUSHA
DISTRICT    KARATU
MALE         23303
FEMALE       23181
Name: 2, dtype: object

To select range of row and column

In [43]:
##Select first four rows(including header) and first three column (including SN)
data.iloc[0:3,0:2]

Unnamed: 0,REGION,DISTRICT
0,ARUSHA,ARUSHA RURAL
1,ARUSHA,ARUSHA URBAN
2,ARUSHA,KARATU


Select only the specified range of column

In [44]:
data.iloc[:,1:3] 

Unnamed: 0,DISTRICT,MALE
0,ARUSHA RURAL,32475.0
1,ARUSHA URBAN,36315.0
2,KARATU,23303.0
3,LONGIDO,10584.0
4,MERU,33854.0
5,MONDULI,13689.0
6,NGORONGORO,16017.0
7,ILALA,73129.0
8,KINONDONI,91554.0
9,TEMEKE,85118.0


To select  different columns 

In [45]:
data.iloc[:,[0, 3]]

Unnamed: 0,REGION,FEMALE
0,ARUSHA,33698.0
1,ARUSHA,36993.0
2,ARUSHA,23181.0
3,ARUSHA,9045.0
4,ARUSHA,34171.0
5,ARUSHA,12888.0
6,ARUSHA,12162.0
7,DAR ES SALAAM,78309.0
8,DAR ES SALAAM,93921.0
9,DAR ES SALAAM,88813.0


<div class="alert alert-success">
    <b>Activity 3</b>: What happens when you type the code below?
</div>

1. `university.loc[[0, 10, 50], :]`
2. `university.iloc[0:4, 1:4]`
3. `university[0:3]`
4. `university[:5]`
5. `university[-1:]`    
          

### Subsetting Data Using Criteria

Often, you want to select rows based on a certain condition. This can be done with 'boolean indexing' (like a where clause in SQL) and comparable to numpy. 

The indexer (or boolean mask) should be 1-dimensional and the same length as the thing being indexed.

For example, we can select all rows that have female higher than 50000.

In [46]:
data[data.FEMALE > 50000]

Unnamed: 0,REGION,DISTRICT,MALE,FEMALE
7,DAR ES SALAAM,ILALA,73129.0,78309.0
8,DAR ES SALAAM,KINONDONI,91554.0,93921.0
9,DAR ES SALAAM,TEMEKE,85118.0,88813.0
14,DODOMA,KONDOA,48859.0,55869.0
19,GEITA,GEITA,115698.0,116328.0
30,KAGERA,KARAGWE,58254.0,61567.0
33,KAGERA,MULEBA,51631.0,51767.0
41,KIGOMA,KASULU,58799.0,60225.0
43,KIGOMA,KIGOMA RURAL,55053.0,55122.0
80,MBEYA,MBOZI,69952.0,72439.0


Or we can select all rows which are in Arusha

In [47]:
data[data.REGION == 'ARUSHA']

Unnamed: 0,REGION,DISTRICT,MALE,FEMALE
0,ARUSHA,ARUSHA RURAL,32475.0,33698.0
1,ARUSHA,ARUSHA URBAN,36315.0,36993.0
2,ARUSHA,KARATU,23303.0,23181.0
3,ARUSHA,LONGIDO,10584.0,9045.0
4,ARUSHA,MERU,33854.0,34171.0
5,ARUSHA,MONDULI,13689.0,12888.0
6,ARUSHA,NGORONGORO,16017.0,12162.0


You can select data based on criteria and choose which column to display. Example let select all districts with male erollment less than 1000.

In [48]:
data[data.MALE < 10000][['DISTRICT']]

Unnamed: 0,DISTRICT
37,MPANDA URBAN
52,SIHA
55,LINDI URBAN
56,LIWALE
60,BABATI URBAN
93,MTWARA URBAN
111,KIBAHA RURAL
114,MAFIA
157,KOROGWE URBAN
161,PANGANI


### Sort Data in Pandas

We can also sort data in pandas. For example let us  sort the dataframe's rows by male, in descending order.

In [49]:
data.sort_values(by='MALE', ascending=0)

Unnamed: 0,REGION,DISTRICT,MALE,FEMALE
19,GEITA,GEITA,115698.0,116328.0
8,DAR ES SALAAM,KINONDONI,91554.0,93921.0
9,DAR ES SALAAM,TEMEKE,85118.0,88813.0
102,MWANZA,SENGEREMA,75842.0,75419.0
134,SIMIYU,BARIADI RURAL,75089.0,79153.0
127,SHINYANGA,KAHAMA DC,73794.0,75747.0
7,DAR ES SALAAM,ILALA,73129.0,78309.0
80,MBEYA,MBOZI,69952.0,72439.0
158,TANGA,LUSHOTO,69395.0,70238.0
41,KIGOMA,KASULU,58799.0,60225.0


Sorting by region descending

In [50]:
data.sort_values(by=['REGION'],ascending=False)

Unnamed: 0,REGION,DISTRICT,MALE,FEMALE
162,TANGA,TANGA URBAN,26555.0,26487.0
157,TANGA,KOROGWE URBAN,6240.0,6057.0
154,TANGA,HANDENI,35554.0,35379.0
155,TANGA,KILINDI,39676.0,40330.0
156,TANGA,KOROGWE RURAL,25306.0,25000.0
153,TANGA,BUMBULI,,
158,TANGA,LUSHOTO,69395.0,70238.0
159,TANGA,MKINGA,12799.0,12568.0
160,TANGA,MUHEZA,19672.0,19650.0
161,TANGA,PANGANI,4903.0,4925.0


 **Putting it together**: Select Districts and Region  with Male > 50000 and Female > 30000, sorted by Districts

In [51]:
temp_data = data[(data.MALE > 50000) & (data.FEMALE >30000)]
temp_data2 = temp_data[['DISTRICT', 'REGION']]
temp_data2.sort_values(by="DISTRICT", ascending=0)

Unnamed: 0,DISTRICT,REGION
9,TEMEKE,DAR ES SALAAM
102,SENGEREMA,MWANZA
148,NZEGA,TABORA
33,MULEBA,KAGERA
80,MBOZI,MBEYA
99,MAGU,MWANZA
158,LUSHOTO,TANGA
8,KINONDONI,DAR ES SALAAM
85,KILOSA,MOROGORO
43,KIGOMA RURAL,KIGOMA


The above code is equivalent to:

In [52]:
data[(data.MALE > 50000) & (data.FEMALE >30000)][['DISTRICT', 'REGION']].sort_values(by="DISTRICT", ascending=0)

Unnamed: 0,DISTRICT,REGION
9,TEMEKE,DAR ES SALAAM
102,SENGEREMA,MWANZA
148,NZEGA,TABORA
33,MULEBA,KAGERA
80,MBOZI,MBEYA
99,MAGU,MWANZA
158,LUSHOTO,TANGA
8,KINONDONI,DAR ES SALAAM
85,KILOSA,MOROGORO
43,KIGOMA RURAL,KIGOMA


<div class="alert alert-success">
    <b>Activity 4</b>: Select all universities with Bachelor enrollment greater than 100, Masters enrollment greater than 50 and and atleast 10 PhD enrollment.
</div>

<div class="alert alert-info" style="font-size:120%">
<b>NOTE</b>: <br><br>

So as a summary, `[]` provides the following convenience shortcuts:

* **Series**: selecting a **label**: `s[label]`
* **DataFrame**: selecting a single or multiple **columns**:`df['col']` or `df[['col1', 'col2']]`
* **DataFrame**: slicing or filtering the **rows**: `df['row_label1':'row_label2']` or `df[mask]`

</div>

In [None]:
df = pd.read_csv('data/university.csv')
df.head()

## Basic operations on Series and DataFrames

Just like with numpy arrays, many operations are element-wise. We can apply NumPy ufuncs (element-wise array methods) to pandas objects.

In [53]:
np.max(data.FEMALE)

116328.0

In [54]:
data['FEMALE'].max()

116328.0

In [55]:
data["TOTAL"]=data[["FEMALE", "MALE"]].sum(axis=1)
data.head()

Unnamed: 0,REGION,DISTRICT,MALE,FEMALE,TOTAL
0,ARUSHA,ARUSHA RURAL,32475.0,33698.0,66173.0
1,ARUSHA,ARUSHA URBAN,36315.0,36993.0,73308.0
2,ARUSHA,KARATU,23303.0,23181.0,46484.0
3,ARUSHA,LONGIDO,10584.0,9045.0,19629.0
4,ARUSHA,MERU,33854.0,34171.0,68025.0


### Function Application and Mapping


When you have an operation which does NOT work element-wise or you have no idea how to do it directly in Pandas, use the **apply()** function. A typical use case is with a custom written or a **lambda** function


In [56]:
f = lambda x: (x - x.mean())/x.std()

In [57]:
data[["FEMALE", "MALE", "TOTAL"]].apply(f)

Unnamed: 0,FEMALE,MALE,TOTAL
0,0.124343,0.092877,0.369988
1,0.294130,0.296550,0.539462
2,-0.417582,-0.393603,-0.097673
3,-1.145990,-1.068215,-0.735545
4,0.148716,0.166019,0.413978
5,-0.947966,-0.903527,-0.570513
6,-0.985376,-0.780050,-0.532462
7,2.423084,2.249153,2.395239
8,3.227548,3.226410,3.203700
9,2.964340,2.885046,2.929502


### Unique Values, Value Counts, and Membership

The `unique` ,gives  an array of the unique values in a series. Fo example suppose we want to know different wealth index category available in the dataset.

In [58]:
data['REGION'].unique()

array(['ARUSHA', 'DAR ES SALAAM', 'DODOMA', 'GEITA', 'IRINGA', 'KAGERA',
       'KATAVI', 'KIGOMA', 'KILIMANJARO', 'LINDI', 'MANYARA', 'MARA',
       'MBEYA', 'MOROGORO', 'MTWARA', 'MWANZA', 'NJOMBE', 'PWANI',
       'RUKWA', 'RUVUMA', 'SHINYANGA', 'SIMIYU', 'SINGIDA', 'TABORA',
       'TANGA'], dtype=object)

`value_counts` computes a Series
containing value frequencies

In [60]:
data['REGION'].value_counts()

TANGA            10
MBEYA            10
KAGERA            8
MARA              8
MWANZA            7
ARUSHA            7
MOROGORO          7
TABORA            7
SHINYANGA         7
KIGOMA            7
DODOMA            7
KILIMANJARO       7
MTWARA            7
PWANI             7
GEITA             6
MANYARA           6
SINGIDA           6
LINDI             6
RUVUMA            6
NJOMBE            6
SIMIYU            6
IRINGA            4
RUKWA             4
KATAVI            4
DAR ES SALAAM     3
Name: REGION, dtype: int64

`value_counts` is
also available as a top-level pandas method that can be used with any array or
sequence

In [61]:
pd.value_counts(data['REGION'])

TANGA            10
MBEYA            10
KAGERA            8
MARA              8
MWANZA            7
ARUSHA            7
MOROGORO          7
TABORA            7
SHINYANGA         7
KIGOMA            7
DODOMA            7
KILIMANJARO       7
MTWARA            7
PWANI             7
GEITA             6
MANYARA           6
SINGIDA           6
LINDI             6
RUVUMA            6
NJOMBE            6
SIMIYU            6
IRINGA            4
RUKWA             4
KATAVI            4
DAR ES SALAAM     3
Name: REGION, dtype: int64

### `isin` and `string` methods

The `isin` method of Series is very useful to select rows that may contain certain values:

In [62]:
data[data['REGION'].isin(['MWANZA'])]

Unnamed: 0,REGION,DISTRICT,MALE,FEMALE,TOTAL
97,MWANZA,ILEMELA,34923.0,36364.0,71287.0
98,MWANZA,KWIMBA,40755.0,43045.0,83800.0
99,MWANZA,MAGU,58405.0,59754.0,118159.0
100,MWANZA,MISUNGWI,33284.0,34264.0,67548.0
101,MWANZA,NYAMAGANA,36646.0,38017.0,74663.0
102,MWANZA,SENGEREMA,75842.0,75419.0,151261.0
103,MWANZA,UKEREWE,39612.0,40203.0,79815.0


Let's say we want to select all data for which the district starts with a 'B'. In Python, when having a string, we could use the `startswith` method:

In [63]:
data[data['DISTRICT'].str.startswith('B')]

Unnamed: 0,REGION,DISTRICT,MALE,FEMALE,TOTAL
10,DODOMA,BAHI,15672.0,15967.0,31639.0
17,GEITA,BUKOMBE,39490.0,40178.0,79668.0
27,KAGERA,BIHARAMULO,20620.0,20865.0,41485.0
28,KAGERA,BUKOBA RURAL,27577.0,28264.0,55841.0
29,KAGERA,BUKOBA URBAN,10377.0,10654.0,21031.0
39,KIGOMA,BUHIGWE,,,0.0
59,MANYARA,BABATI RURAL,29144.0,30074.0,59218.0
60,MANYARA,BABATI URBAN,7910.0,8081.0,15991.0
65,MARA,BUNDA,43433.0,42739.0,86172.0
66,MARA,BUTIAMA,,,0.0


## Data summarization and  Descriptive Statistics

Descriptive statistics can give you great insight into the shape of each attribute. The **describe()** function on the Pandas DataFrame lists 8 statistical properties of each attribute:

* Count
* Mean
* Standard Devaition
* Minimum Value
* 25th Percentile
* 50th Percentile (Median)
* 75th Percentile
* Maximum Value

For example to obtain the statistics summary  for Cities data.

Calling DataFrame’s sum method returns a Series containing column sums:

In [64]:
data[['FEMALE', 'MALE']].sum()

FEMALE    4160892.0
MALE      4086280.0
dtype: float64

Passing axis='columns' or axis=1 sums across the columns instead:

In [None]:
data[['FEMALE', 'MALE']].sum(axis=1)

NA values are excluded unless the entire slice (row or column in this case) is NA. This can be disabled with the skipna option:

In [None]:
data[['FEMALE', 'MALE']].sum(axis=1,skipna=False)

Some methods, like idxmin and idxmax , return indirect statistics like the index value
where the minimum or maximum values are attained

In [None]:
data[['FEMALE', 'MALE']].idxmax()

describe is one
such example, producing multiple summary statistics in one shot

In [None]:
data.describe()

To obtain descriptive statistics of a particular column use:

In [None]:
data['MALE'].mean()

<div class="alert alert-success">

<b>Activity 5</b>:

 <ul>
  <li>Select all rows for Female stduents in the university dataset and calculate the mean of female students enrolled in bachelor program. Do the same for the male students.</li>
</ul>
</div>

## References

* [Jupyter Notebooks for the Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook)
* [Pandas & Seaborn - A guide to handle & visualize data elegantly](https://tryolabs.com/blog/2017/03/16/pandas-seaborn-a-guide-to-handle-visualize-data-elegantly/)