# Pandas Tutorial I
## Botswana 2014 General Election Results 

##### What is pandas? 
"pandas is an open source, BSD-licensed library providing high-performance, 
easy-to-use data structures and data analysis tools for the Python programming language." - https://pandas.pydata.org/pandas-docs/stable/overview.html

##### And our tutorial? 

This is an introductory tutorial. We will be using data from Botswana's 2014 General Elections. The data is available in 
an excel spreadsheet that we will load into pandas for analysis.

Our approach would be very simple. We will introduce pandas features and functions as we work with our dataset. 


Have fun!

## 1. Loading data into pandas

Our data is spreadsheet (.xlsx) format. Our approach is to:

- Import the pandas library
- Read the data into pandas dataframe. 

In [1]:
#Import pandas
import pandas as pd

In [2]:
#Check pandas version
pd.__version__

'0.22.0'

## Pandas data structures

There are two main data structures in pandas.

**Series** - a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). 


**Dataframe** - 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. One can pass **index**(rows labels) and **columns**(column labels) with the data.

We want to read tabular data into a dataframe. There are different I/O tools for reading data into dataframes.
- read_csv()
- read_excel()
- read_table() 
- read_sql()     etc.

Reference for I/O tools - https://pandas.pydata.org/pandas-docs/stable/io.html

Our data is in .xlsx format so we will use the read_excel() method passing in the filename as the only argument.

In [3]:
#Read data into a dataframe.
results_df = pd.read_excel('data/gen_elections_2014_master.xlsx')

## Explore the data

To view a small sample of the results_df dataframe we can use **head()** and **tail()** methods. By default they display five items by a custom number may be passed.

In [4]:
results_df.head(10)

Unnamed: 0,CONSTITUENCY,CANDIDATE,PARTY,TOTAL REGISTERED,CAST VOTES,REJECTS,VALID VOTES,PARTY VOTES
0,01 Chobe,Ronald Machana Shamukuni,BDP,8942.0,7354.0,74.0,7280.0,4114
1,,Gibson M.R Nshimwe,BCP,,,,,3166
2,02 Maun East,Konstantinos Markus,BDP,16774.0,13607.0,151.0,13456.0,6046
3,,Goretetse Kekgonegile,BCP,,,,,5304
4,,Osimilwe O. Fish,UDC,,,,,2062
5,,Simon Lethake,IND,,,,,44
6,03 Maun West,Tawana Moremi,UDC,18329.0,15100.0,137.0,14963.0,7271
7,,Reaboke Mbulawa,BDP,,,,,5335
8,,George Lubinda,BCP,,,,,2357
9,04 Ngami,Thato Kwerepe,BDP,18159.0,15055.0,175.0,14880.0,7063


In [5]:
results_df.tail()

Unnamed: 0,CONSTITUENCY,CANDIDATE,PARTY,TOTAL REGISTERED,CAST VOTES,REJECTS,VALID VOTES,PARTY VOTES
187,56 Ghanzi North,Noah Salakae,UDC,9156.0,7772.0,88.0,7684.0,3999
188,,Johnie Keemenao Swatz,BDP,,,,,3685
189,57 Ghanzi South,Christiaan De Graaff,BDP,10687.0,9206.0,67.0,9139.0,4812
190,,Motsamai G. Jelson Motsamai,UDC,,,,,3846
191,,Brains Kebogile Kwadipane,BCP,,,,,481


Useful dataframes attributes:

**dtypes** - gives the dataframe column data types

**shape** - shows the dimensions of a dataframe

**columns** - gives the column names of a dataframe

In [6]:
results_df.dtypes

CONSTITUENCY         object
CANDIDATE            object
PARTY                object
TOTAL REGISTERED    float64
CAST VOTES          float64
REJECTS             float64
VALID VOTES         float64
PARTY VOTES           int64
dtype: object

In [7]:
results_df.shape

(192, 8)

In [8]:
results_df.columns

Index(['CONSTITUENCY', 'CANDIDATE', 'PARTY', 'TOTAL REGISTERED', 'CAST VOTES',
       'REJECTS', 'VALID VOTES', 'PARTY VOTES'],
      dtype='object')

Another useful method is the **info()** for a summary of our dataframe.

In [9]:
results_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192 entries, 0 to 191
Data columns (total 8 columns):
CONSTITUENCY        57 non-null object
CANDIDATE           192 non-null object
PARTY               192 non-null object
TOTAL REGISTERED    57 non-null float64
CAST VOTES          57 non-null float64
REJECTS             57 non-null float64
VALID VOTES         57 non-null float64
PARTY VOTES         192 non-null int64
dtypes: float64(4), int64(1), object(3)
memory usage: 12.1+ KB


### Rename our dataframe columns

First create a python list with corresponding the desired column names

In [10]:
#Create a list with desired column names
column_names = ['constituency_ref', 'candidate_name', 'party_name', 'registered_voters',
                'cast_votes', 'rejected_votes', 'valid_votes', 'party_votes']

Assign the new column names to the dataframe columns

In [11]:
#Rename columns
results_df.columns = column_names
results_df.columns

Index(['constituency_ref', 'candidate_name', 'party_name', 'registered_voters',
       'cast_votes', 'rejected_votes', 'valid_votes', 'party_votes'],
      dtype='object')

In [12]:
#OR
#results_df.rename(columns={'CONSTITUENCY':'constituency_ref', 
#                            'CANDIDATE':'candidate_name', 
#                            'PARTY':'party_name', 
#                            'TOTAL REGISTERED':'registered_voters',
#                            'CAST VOTES':'cast_votes', 
#                            'REJECTS':'rejected_votes', 
#                            'VALID VOTES':'valid_votes', 
#                            'PARTY VOTES':'party_votes'})

### RECAP

We have successfully loaded our data into a pandas dataframe. We need to get the data into a form that will simplify analysis. We will create two dataframes.

1. **candidate_votes** with columns - 'constituency_ref','candidate_name','party_name','party_votes'
2. **constituency_stats** with columns -'constituency_ref','registered_voters','cast_votes','rejected_votes','valid_votes'




## Creating dataframes from a dataframe

### Create candidate_votes dataframe

We just need to create a dataframe from subsetting the columns of **results_df**. We will first create a list with the desired columns and then use it for selecting the column subset we want.

In [13]:
desired_columns = ['constituency_ref','candidate_name','party_name','party_votes']
candidate_votes = results_df[desired_columns]
candidate_votes.head()

Unnamed: 0,constituency_ref,candidate_name,party_name,party_votes
0,01 Chobe,Ronald Machana Shamukuni,BDP,4114
1,,Gibson M.R Nshimwe,BCP,3166
2,02 Maun East,Konstantinos Markus,BDP,6046
3,,Goretetse Kekgonegile,BCP,5304
4,,Osimilwe O. Fish,UDC,2062


In [14]:
candidate_votes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192 entries, 0 to 191
Data columns (total 4 columns):
constituency_ref    57 non-null object
candidate_name      192 non-null object
party_name          192 non-null object
party_votes         192 non-null int64
dtypes: int64(1), object(3)
memory usage: 6.1+ KB


#### CONCERNS

1. Missing values in the 'constituency_ref' columns. **How many?** 

We address the issue later.


### Create **elections_stats** dataframe

We will repeat the same approach for creating the **candidate_votes** dataframe. We will however subset directly using a list of desired column names.

In [15]:
constituency_stats = results_df[['constituency_ref','registered_voters',
                                        'cast_votes','rejected_votes','valid_votes']]
constituency_stats.head(10)

Unnamed: 0,constituency_ref,registered_voters,cast_votes,rejected_votes,valid_votes
0,01 Chobe,8942.0,7354.0,74.0,7280.0
1,,,,,
2,02 Maun East,16774.0,13607.0,151.0,13456.0
3,,,,,
4,,,,,
5,,,,,
6,03 Maun West,18329.0,15100.0,137.0,14963.0
7,,,,,
8,,,,,
9,04 Ngami,18159.0,15055.0,175.0,14880.0


In [16]:
constituency_stats.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192 entries, 0 to 191
Data columns (total 5 columns):
constituency_ref     57 non-null object
registered_voters    57 non-null float64
cast_votes           57 non-null float64
rejected_votes       57 non-null float64
valid_votes          57 non-null float64
dtypes: float64(4), object(1)
memory usage: 7.6+ KB


#### CONCERNS

1. Too many missing values in the all columns? Only 57 non-null entries in all columns.
2. All numeric types should be integers.

## Handling missing values

The dataframes we have created have columns with missing values.

### Cleaning the candidate_votes dataframe

1. Replace missing values with constituency names. Fill-forward
2. Create a new column **constituency_name** from 'constituency_ref' values. This would be constituency names ONLY without leading numbers.
3. Drop the **constituency_ref** column
4. Re-order columns labels

In [17]:
#Just to recap
candidate_votes.head()

Unnamed: 0,constituency_ref,candidate_name,party_name,party_votes
0,01 Chobe,Ronald Machana Shamukuni,BDP,4114
1,,Gibson M.R Nshimwe,BCP,3166
2,02 Maun East,Konstantinos Markus,BDP,6046
3,,Goretetse Kekgonegile,BCP,5304
4,,Osimilwe O. Fish,UDC,2062


**Task: ** To replace the missing values in the **constituency_name** column requires understanding two ideas.
    - How to select data by column name.
    - How to fill in missing values

**Select data by column name**

In [18]:
candidate_votes['constituency_ref'].head(10)

0        01 Chobe
1             NaN
2    02 Maun East
3             NaN
4             NaN
5             NaN
6    03 Maun West
7             NaN
8             NaN
9        04 Ngami
Name: constituency_ref, dtype: object

To fill in missing values we use the **fillna()** method on the column (actually a series!) of interest.

In [19]:
candidate_votes['constituency_ref'].fillna(method='ffill').head(20)

0         01 Chobe
1         01 Chobe
2     02 Maun East
3     02 Maun East
4     02 Maun East
5     02 Maun East
6     03 Maun West
7     03 Maun West
8     03 Maun West
9         04 Ngami
10        04 Ngami
11        04 Ngami
12     05 Okavango
13     05 Okavango
14     05 Okavango
15    06 Tati East
16    06 Tati East
17    06 Tati East
18    07 Tati West
19    07 Tati West
Name: constituency_ref, dtype: object

In [20]:
candidate_votes.head()

Unnamed: 0,constituency_ref,candidate_name,party_name,party_votes
0,01 Chobe,Ronald Machana Shamukuni,BDP,4114
1,,Gibson M.R Nshimwe,BCP,3166
2,02 Maun East,Konstantinos Markus,BDP,6046
3,,Goretetse Kekgonegile,BCP,5304
4,,Osimilwe O. Fish,UDC,2062


**What has happened?** Though our series showed that we have filled the values forward our dataframe values remain unchanged? 

**Is this a bug?** NO. We need to assign the modified values to the dataframe.

In [21]:
candidate_votes['constituency_ref'] = candidate_votes['constituency_ref'].fillna(method='ffill')
candidate_votes.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,constituency_ref,candidate_name,party_name,party_votes
0,01 Chobe,Ronald Machana Shamukuni,BDP,4114
1,01 Chobe,Gibson M.R Nshimwe,BCP,3166
2,02 Maun East,Konstantinos Markus,BDP,6046
3,02 Maun East,Goretetse Kekgonegile,BCP,5304
4,02 Maun East,Osimilwe O. Fish,UDC,2062


Have we changed our original dataframe?

In [22]:
results_df.head(10)

Unnamed: 0,constituency_ref,candidate_name,party_name,registered_voters,cast_votes,rejected_votes,valid_votes,party_votes
0,01 Chobe,Ronald Machana Shamukuni,BDP,8942.0,7354.0,74.0,7280.0,4114
1,,Gibson M.R Nshimwe,BCP,,,,,3166
2,02 Maun East,Konstantinos Markus,BDP,16774.0,13607.0,151.0,13456.0,6046
3,,Goretetse Kekgonegile,BCP,,,,,5304
4,,Osimilwe O. Fish,UDC,,,,,2062
5,,Simon Lethake,IND,,,,,44
6,03 Maun West,Tawana Moremi,UDC,18329.0,15100.0,137.0,14963.0,7271
7,,Reaboke Mbulawa,BDP,,,,,5335
8,,George Lubinda,BCP,,,,,2357
9,04 Ngami,Thato Kwerepe,BDP,18159.0,15055.0,175.0,14880.0,7063


Are all null values filled? Use the **info()** method to confirm.

In [23]:
candidate_votes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192 entries, 0 to 191
Data columns (total 4 columns):
constituency_ref    192 non-null object
candidate_name      192 non-null object
party_name          192 non-null object
party_votes         192 non-null int64
dtypes: int64(1), object(3)
memory usage: 6.1+ KB


**Task:** Create a new column constituency_name from 'constituency_ref' values. This would be constituency names ONLY without leading numbers.
Drop the constituency_ref column

We can apply functions to the columns of a dataframe by using the **apply()** method. We will also introduce a lambda operator for creating anonymous functions in python. What we will do is very simple, we will slice the strings to pick characters we want. Check out string slicing in Python if confused.

In [24]:
#Create 'constituency_name' column without leading numbers
candidate_votes['constituency_name'] = candidate_votes['constituency_ref'].apply(lambda x: x[3:])
candidate_votes.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,constituency_ref,candidate_name,party_name,party_votes,constituency_name
0,01 Chobe,Ronald Machana Shamukuni,BDP,4114,Chobe
1,01 Chobe,Gibson M.R Nshimwe,BCP,3166,Chobe
2,02 Maun East,Konstantinos Markus,BDP,6046,Maun East
3,02 Maun East,Goretetse Kekgonegile,BCP,5304,Maun East
4,02 Maun East,Osimilwe O. Fish,UDC,2062,Maun East


In [25]:
candidate_votes.head()

Unnamed: 0,constituency_ref,candidate_name,party_name,party_votes,constituency_name
0,01 Chobe,Ronald Machana Shamukuni,BDP,4114,Chobe
1,01 Chobe,Gibson M.R Nshimwe,BCP,3166,Chobe
2,02 Maun East,Konstantinos Markus,BDP,6046,Maun East
3,02 Maun East,Goretetse Kekgonegile,BCP,5304,Maun East
4,02 Maun East,Osimilwe O. Fish,UDC,2062,Maun East


In [26]:
candidate_votes.tail()

Unnamed: 0,constituency_ref,candidate_name,party_name,party_votes,constituency_name
187,56 Ghanzi North,Noah Salakae,UDC,3999,Ghanzi North
188,56 Ghanzi North,Johnie Keemenao Swatz,BDP,3685,Ghanzi North
189,57 Ghanzi South,Christiaan De Graaff,BDP,4812,Ghanzi South
190,57 Ghanzi South,Motsamai G. Jelson Motsamai,UDC,3846,Ghanzi South
191,57 Ghanzi South,Brains Kebogile Kwadipane,BCP,481,Ghanzi South


**Task:** Drop the constituency_ref column

Use the **drop()** method

In [27]:
candidate_votes.drop('constituency_ref', axis=1)

Unnamed: 0,candidate_name,party_name,party_votes,constituency_name
0,Ronald Machana Shamukuni,BDP,4114,Chobe
1,Gibson M.R Nshimwe,BCP,3166,Chobe
2,Konstantinos Markus,BDP,6046,Maun East
3,Goretetse Kekgonegile,BCP,5304,Maun East
4,Osimilwe O. Fish,UDC,2062,Maun East
5,Simon Lethake,IND,44,Maun East
6,Tawana Moremi,UDC,7271,Maun West
7,Reaboke Mbulawa,BDP,5335,Maun West
8,George Lubinda,BCP,2357,Maun West
9,Thato Kwerepe,BDP,7063,Ngami


In [28]:
candidate_votes.head()

Unnamed: 0,constituency_ref,candidate_name,party_name,party_votes,constituency_name
0,01 Chobe,Ronald Machana Shamukuni,BDP,4114,Chobe
1,01 Chobe,Gibson M.R Nshimwe,BCP,3166,Chobe
2,02 Maun East,Konstantinos Markus,BDP,6046,Maun East
3,02 Maun East,Goretetse Kekgonegile,BCP,5304,Maun East
4,02 Maun East,Osimilwe O. Fish,UDC,2062,Maun East


**What happened?** We got the view we wanted but we didn't change the dataframe. Set **inplace** = **True** in the **drop()** method. 

In [29]:
candidate_votes.drop('constituency_ref', axis=1, inplace=True)
candidate_votes.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,candidate_name,party_name,party_votes,constituency_name
0,Ronald Machana Shamukuni,BDP,4114,Chobe
1,Gibson M.R Nshimwe,BCP,3166,Chobe
2,Konstantinos Markus,BDP,6046,Maun East
3,Goretetse Kekgonegile,BCP,5304,Maun East
4,Osimilwe O. Fish,UDC,2062,Maun East


In [30]:
candidate_votes.columns

Index(['candidate_name', 'party_name', 'party_votes', 'constituency_name'], dtype='object')

**Task:** Re-order columns labels

In [31]:
candidate_votes = candidate_votes[[ 'constituency_name','candidate_name',
 'party_name','party_votes']]
candidate_votes.head()

Unnamed: 0,constituency_name,candidate_name,party_name,party_votes
0,Chobe,Ronald Machana Shamukuni,BDP,4114
1,Chobe,Gibson M.R Nshimwe,BCP,3166
2,Maun East,Konstantinos Markus,BDP,6046
3,Maun East,Goretetse Kekgonegile,BCP,5304
4,Maun East,Osimilwe O. Fish,UDC,2062


### Cleaning the elections_stats dataframe

1. Drop all rows with missing values.
2. Convert all numerics to integers.
3. Drop the leading numbers from the 'constituency_ref' column values
4. Rename the 'constituency_ref' column

In [32]:
constituency_stats.head()

Unnamed: 0,constituency_ref,registered_voters,cast_votes,rejected_votes,valid_votes
0,01 Chobe,8942.0,7354.0,74.0,7280.0
1,,,,,
2,02 Maun East,16774.0,13607.0,151.0,13456.0
3,,,,,
4,,,,,


**Task:** Drop all rows with missing values.

In [33]:
constituency_stats.dropna(axis=0, how='all', inplace=True)
constituency_stats.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,constituency_ref,registered_voters,cast_votes,rejected_votes,valid_votes
0,01 Chobe,8942.0,7354.0,74.0,7280.0
2,02 Maun East,16774.0,13607.0,151.0,13456.0
6,03 Maun West,18329.0,15100.0,137.0,14963.0
9,04 Ngami,18159.0,15055.0,175.0,14880.0
12,05 Okavango,15243.0,12726.0,174.0,12552.0


In [34]:
constituency_stats.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 57 entries, 0 to 189
Data columns (total 5 columns):
constituency_ref     57 non-null object
registered_voters    57 non-null float64
cast_votes           57 non-null float64
rejected_votes       57 non-null float64
valid_votes          57 non-null float64
dtypes: float64(4), object(1)
memory usage: 2.7+ KB


In [35]:
constituency_stats.columns

Index(['constituency_ref', 'registered_voters', 'cast_votes', 'rejected_votes',
       'valid_votes'],
      dtype='object')

**Task:** Convert all numerics to integers.

In [36]:
numeric_columns =['registered_voters', 'cast_votes', 'rejected_votes','valid_votes']
for col in numeric_columns:
    constituency_stats[col] = constituency_stats[col].astype(int)
constituency_stats.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,constituency_ref,registered_voters,cast_votes,rejected_votes,valid_votes
0,01 Chobe,8942,7354,74,7280
2,02 Maun East,16774,13607,151,13456
6,03 Maun West,18329,15100,137,14963
9,04 Ngami,18159,15055,175,14880
12,05 Okavango,15243,12726,174,12552


**Task:** Drop the leading numbers from the **constituency_ref** column values

In [37]:
new_column = constituency_stats['constituency_ref'].apply(lambda x: x[3:])
new_column.head()

0         Chobe
2     Maun East
6     Maun West
9         Ngami
12     Okavango
Name: constituency_ref, dtype: object

In [38]:
constituency_stats = constituency_stats.assign(constituency_ref = new_column)
constituency_stats.head()

Unnamed: 0,constituency_ref,registered_voters,cast_votes,rejected_votes,valid_votes
0,Chobe,8942,7354,74,7280
2,Maun East,16774,13607,151,13456
6,Maun West,18329,15100,137,14963
9,Ngami,18159,15055,175,14880
12,Okavango,15243,12726,174,12552


**Task:** Rename **constituency_ref** to **constituency_name** 

In [39]:
constituency_stats.rename(columns={'constituency_ref':'constituency_name'}, inplace=True)
constituency_stats.head()

Unnamed: 0,constituency_name,registered_voters,cast_votes,rejected_votes,valid_votes
0,Chobe,8942,7354,74,7280
2,Maun East,16774,13607,151,13456
6,Maun West,18329,15100,137,14963
9,Ngami,18159,15055,175,14880
12,Okavango,15243,12726,174,12552


In [40]:
constituency_stats.columns

Index(['constituency_name', 'registered_voters', 'cast_votes',
       'rejected_votes', 'valid_votes'],
      dtype='object')

### RECAP

We have created two clean dataframe: **candidates_votes** and **election_stats**

We will use these to answer questions of interest about the election results in **Part II** of the tutorial. Our final task is to save the to disk as **csv** files.


**Task:** Save dataframes to **csv** files on disk.

Use **to_csv()** method

In [41]:
candidate_votes.to_csv('data/candidate_votes.csv',index=False)
constituency_stats.to_csv('data/constituency_stats.csv',index=False)

In [42]:
ls

candidate_votes.csv            [0m[01;35mpandas_cheat_sheet.png[0m
census_pop_2011.xlsx           pandas_tutorial_2014_elections_1.ipynb
constituency_stats.csv         pandas_tutorial_2014_elections_II.ipynb
[01;34mdata[0m/                          pandas_tutorial_elections.py
elections_stats.csv            party_votes.csv
gen_elections_2014_master.ods  Untitled.ipynb
