# Pandas Tutorial I
## Botswana 2014 General Election Results 

##### What is pandas? 
"pandas is an open source, BSD-licensed library providing high-performance, 
easy-to-use data structures and data analysis tools for the Python programming language." - https://pandas.pydata.org/pandas-docs/stable/overview.html

##### And our tutorial? 

This is an introductory tutorial. We will be using data from Botswana's 2014 General Elections. The data is available in 
an excel spreadsheet that we will load into pandas for analysis.

Our approach would be very simple. We will introduce pandas features and functions as we work with our dataset. 


Have fun!

## 1. Loading data into pandas

Our data is spreadsheet (.xlsx) format. Our approach is to:

- Import the pandas library
- Read the data into pandas dataframe. 

In [None]:
#Import pandas
import pandas as pd

In [None]:
#Check pandas version
pd.__version__

## Pandas data structures

There are two main data structures in pandas.

**Series** - a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). 


**Dataframe** - 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. One can pass **index**(rows labels) and **columns**(column labels) with the data.

We want to read tabular data into a dataframe. There are different I/O tools for reading data into dataframes.
- read_csv()
- read_excel()
- read_table() 
- read_sql()     etc.

Reference for I/O tools - https://pandas.pydata.org/pandas-docs/stable/io.html

Our data is in .xlsx format so we will use the read_excel() method passing in the filename as the only argument.

In [None]:
#Read data into a dataframe.
results_df = pd.read_excel('data/gen_elections_2014_master.xlsx')

## Explore the data

To view a small sample of the results_df dataframe we can use **head()** and **tail()** methods. By default they display five items by a custom number may be passed.

In [None]:
results_df.head(10)

In [None]:
results_df.tail()

Useful dataframes attributes:

**dtypes** - gives the dataframe column data types

**shape** - shows the dimensions of a dataframe

**columns** - gives the column names of a dataframe

In [None]:
results_df.dtypes

In [None]:
results_df.shape

In [None]:
results_df.columns

Another useful method is the **info()** for a summary of our dataframe.

In [None]:
results_df.info()

### Rename our dataframe columns

First create a python list with corresponding the desired column names

In [None]:
#Create a list with desired column names
column_names = ['constituency_ref', 'candidate_name', 'party_name', 'registered_voters',
                'cast_votes', 'rejected_votes', 'valid_votes', 'party_votes']

Assign the new column names to the dataframe columns

In [None]:
#Rename columns
results_df.columns = column_names
results_df.columns

In [None]:
#OR
#results_df.rename(columns={'CONSTITUENCY':'constituency_ref', 
#                            'CANDIDATE':'candidate_name', 
#                            'PARTY':'party_name', 
#                            'TOTAL REGISTERED':'registered_voters',
#                            'CAST VOTES':'cast_votes', 
#                            'REJECTS':'rejected_votes', 
#                            'VALID VOTES':'valid_votes', 
#                            'PARTY VOTES':'party_votes'})

### RECAP

We have successfully loaded our data into a pandas dataframe. We need to get the data into a form that will simplify analysis. We will create two dataframes.

1. **candidate_votes** with columns - 'constituency_ref','candidate_name','party_name','party_votes'
2. **constituency_stats** with columns -'constituency_ref','registered_voters','cast_votes','rejected_votes','valid_votes'




## Creating dataframes from a dataframe

### Create candidate_votes dataframe

We just need to create a dataframe from subsetting the columns of **results_df**. We will first create a list with the desired columns and then use it for selecting the column subset we want.

In [None]:
desired_columns = ['constituency_ref','candidate_name','party_name','party_votes']
candidate_votes = results_df[desired_columns]
candidate_votes.head()

In [None]:
candidate_votes.info()

#### CONCERNS

1. Missing values in the 'constituency_ref' columns. **How many?** 

We address the issue later.


### Create **elections_stats** dataframe

We will repeat the same approach for creating the **candidate_votes** dataframe. We will however subset directly using a list of desired column names.

In [None]:
constituency_stats = results_df[['constituency_ref','registered_voters',
                                        'cast_votes','rejected_votes','valid_votes']]
constituency_stats.head(10)

In [None]:
constituency_stats.info()

#### CONCERNS

1. Too many missing values in the all columns? Only 57 non-null entries in all columns.
2. All numeric types should be integers.

## Handling missing values

The dataframes we have created have columns with missing values.

### Cleaning the candidate_votes dataframe

1. Replace missing values with constituency names. Fill-forward
2. Create a new column **constituency_name** from 'constituency_ref' values. This would be constituency names ONLY without leading numbers.
3. Drop the **constituency_ref** column
4. Re-order columns labels

In [None]:
#Just to recap
candidate_votes.head()

**Task: ** To replace the missing values in the **constituency_name** column requires understanding two ideas.
    - How to select data by column name.
    - How to fill in missing values

**Select data by column name**

In [None]:
candidate_votes['constituency_ref'].head(10)

To fill in missing values we use the **fillna()** method on the column (actually a series!) of interest.

In [None]:
candidate_votes['constituency_ref'].fillna(method='ffill').head(20)

In [None]:
candidate_votes.head()

**What has happened?** Though our series showed that we have filled the values forward our dataframe values remain unchanged? 

**Is this a bug?** NO. We need to assign the modified values to the dataframe.

In [None]:
candidate_votes['constituency_ref'] = candidate_votes['constituency_ref'].fillna(method='ffill')
candidate_votes.head()

Have we changed our original dataframe?

In [None]:
results_df.head(10)

Are all null values filled? Use the **info()** method to confirm.

In [None]:
candidate_votes.info()

**Task:** Create a new column constituency_name from 'constituency_ref' values. This would be constituency names ONLY without leading numbers.
Drop the constituency_ref column

We can apply functions to the columns of a dataframe by using the **apply()** method. We will also introduce a lambda operator for creating anonymous functions in python. What we will do is very simple, we will slice the strings to pick characters we want. Check out string slicing in Python if confused.

In [None]:
#Create 'constituency_name' column without leading numbers
candidate_votes['constituency_name'] = candidate_votes['constituency_ref'].apply(lambda x: x[3:])
candidate_votes.head()

In [None]:
candidate_votes.head()

In [None]:
candidate_votes.tail()

**Task:** Drop the constituency_ref column

Use the **drop()** method

In [None]:
candidate_votes.drop('constituency_ref', axis=1)

In [None]:
candidate_votes.head()

**What happened?** We got the view we wanted but we didn't change the dataframe. Set **inplace** = **True** in the **drop()** method. 

In [None]:
candidate_votes.drop('constituency_ref', axis=1, inplace=True)
candidate_votes.head()

In [None]:
candidate_votes.columns

**Task:** Re-order columns labels

In [None]:
candidate_votes = candidate_votes[[ 'constituency_name','candidate_name',
 'party_name','party_votes']]
candidate_votes.head()

### Cleaning the elections_stats dataframe

1. Drop all rows with missing values.
2. Convert all numerics to integers.
3. Drop the leading numbers from the 'constituency_ref' column values
4. Rename the 'constituency_ref' column

In [None]:
constituency_stats.head()

**Task:** Drop all rows with missing values.

In [None]:
constituency_stats.dropna(axis=0, how='all', inplace=True)
constituency_stats.head()

In [None]:
constituency_stats.info()

In [None]:
constituency_stats.columns

**Task:** Convert all numerics to integers.

In [None]:
numeric_columns =['registered_voters', 'cast_votes', 'rejected_votes','valid_votes']
for col in numeric_columns:
    constituency_stats[col] = constituency_stats[col].astype(int)
constituency_stats.head()

**Task:** Drop the leading numbers from the **constituency_ref** column values

In [None]:
new_column = constituency_stats['constituency_ref'].apply(lambda x: x[3:])
new_column.head()

In [None]:
constituency_stats = constituency_stats.assign(constituency_ref = new_column)
constituency_stats.head()

**Task:** Rename **constituency_ref** to **constituency_name** 

In [None]:
constituency_stats.rename(columns={'constituency_ref':'constituency_name'}, inplace=True)
constituency_stats.head()

In [None]:
constituency_stats.columns

### RECAP

We have created two clean dataframe: **candidates_votes** and **election_stats**

We will use these to answer questions of interest about the election results in **Part II** of the tutorial. Our final task is to save the to disk as **csv** files.


**Task:** Save dataframes to **csv** files on disk.

Use **to_csv()** method

In [None]:
candidate_votes.to_csv('data/candidate_votes.csv',index=False)
constituency_stats.to_csv('data/constituency_stats.csv',index=False)

In [None]:
ls