<center><img src="http://i.imgur.com/sSaOozN.png" width="500"></center>

## Course: Computational Thinking for Governance Analytics

### Prof. José Manuel Magallanes, PhD 
* Visiting Professor of Computational Policy at Evans School of Public Policy and Governance, and eScience Institute Senior Data Science Fellow, University of Washington.
* Professor of Government and Political Methodology, Pontificia Universidad Católica del Perú. 

_____

# Session 2:  Data Cleaning and Formatting in Python
<a id='beginning'></a>

Having collected the data does not always allow you to produce some analytics right away. There is often a lot of pre processing to be done, especially if the data comes from scrapping. 

This session is about:

* Cleaning: making sure each cell has a value that could be used in your coming procedures. The _impurities_ do not allow formatting the data correctly, such as the presence of commas instead of points and viceversa, blanks/spaces, unneeded symbols (dollar, euro symbols), non-standard symbols to represent missing values, and so on.

* Fomatting: making sure that the clean values are in the right data type. if you are going to do text analysis, you may need to get rid of repetitive words, normalize them into lower case, and turn them back to their root or stem. For statistical work, you need to differentiate among nominal, ordinal, numerical and strings.

Let me use the next table with information on **Democracy and its components** by country. Take a look:

In [1]:
from IPython.display import IFrame  
wikiLink="https://en.wikipedia.org/wiki/Democracy_Index#Components" 
IFrame(wikiLink, width=900, height=500)

You should always see the table to start guessing what should be source of problems; and, of course, have a general understanding what the data is about. In this case you can see:

1. An _score_ of democracy is offered for the participant countries.
2. Four _levels_ of democracy are offered for the participant countries.
3. The _score_ of democracy is computed from other variables

Let's use pandas to get that table as a data frame (**DF**):

In [2]:
import pandas as pd

wikiTables=pd.read_html(wikiLink, # link
                        header=0, # where is the header?
                        flavor='bs4', # helper to translate html
                        attrs={'class': 'wikitable sortable'}) # attributes to identify element(s)

Remember the object **wikiTables** is a list. I know before hand that our DF is the third one:

In [3]:
wikiTables[2]

Unnamed: 0,Rank,.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}Δ Rank,Country,Regime type,Overall score,Δ Score,Elec­toral pro­cess and plura­lism,Func­tioning of govern­ment,Poli­tical partici­pation,Poli­tical cul­ture,Civil liber­ties
0,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies
1,1,,Norway,Full democracy,9.75,0.06,10.00,9.64,10.00,10.00,9.12
2,2,2,New Zealand,Full democracy,9.37,0.12,10.00,8.93,9.44,8.75,9.71
3,3,3,Finland,Full democracy,9.27,0.07,10.00,9.29,8.89,8.75,9.41
4,4,1,Sweden,Full democracy,9.26,,9.58,9.29,8.33,10.00,9.12
...,...,...,...,...,...,...,...,...,...,...,...
166,162,2,Central African Republic,Authoritarian,1.43,0.11,1.25,0.00,1.67,1.88,2.35
167,164,2,Democratic Republic of the Congo,Authoritarian,1.40,0.27,0.75,0.00,2.22,3.13,0.88
168,165,2,North Korea,Authoritarian,1.08,,0.00,2.50,1.67,1.25,0.00
169,166,31,Myanmar,Authoritarian,1.02,2.04,0.00,0.00,1.67,3.13,0.29


Pandas will show you the _head_ and the _tail_ of the DF. There you can verify if what pandas scrapped is what you expected. 

Let's create a copy of that DF:

In [4]:
DFwiki=wikiTables[2].copy()

# I. Cleaning

1. Keep columns needed

In [5]:
# columns present - add .to_list()
DFwiki.columns.to_list()

['Rank',
 '.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}Δ Rank',
 'Country',
 'Regime type',
 'Overall score',
 'Δ Score',
 'Elec\xadtoral pro\xadcess and plura\xadlism',
 'Func\xadtioning of govern\xadment',
 'Poli\xadtical partici\xadpation',
 'Poli\xadtical cul\xadture',
 'Civil liber\xadties']

In [6]:
# dropping 1st, 2nd and 6th column
bye=[0,1,5]
DFwiki.drop(columns=DFwiki.columns[bye],inplace=True)#inplace!!
DFwiki

Unnamed: 0,Country,Regime type,Overall score,Elec­toral pro­cess and plura­lism,Func­tioning of govern­ment,Poli­tical partici­pation,Poli­tical cul­ture,Civil liber­ties
0,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies
1,Norway,Full democracy,9.75,10.00,9.64,10.00,10.00,9.12
2,New Zealand,Full democracy,9.37,10.00,8.93,9.44,8.75,9.71
3,Finland,Full democracy,9.27,10.00,9.29,8.89,8.75,9.41
4,Sweden,Full democracy,9.26,9.58,9.29,8.33,10.00,9.12
...,...,...,...,...,...,...,...,...
166,Central African Republic,Authoritarian,1.43,1.25,0.00,1.67,1.88,2.35
167,Democratic Republic of the Congo,Authoritarian,1.40,0.75,0.00,2.22,3.13,0.88
168,North Korea,Authoritarian,1.08,0.00,2.50,1.67,1.25,0.00
169,Myanmar,Authoritarian,1.02,0.00,0.00,1.67,3.13,0.29


2. Check column names

In [7]:
# add '.to_list()' to see beyond!
DFwiki.columns.to_list()

['Country',
 'Regime type',
 'Overall score',
 'Elec\xadtoral pro\xadcess and plura\xadlism',
 'Func\xadtioning of govern\xadment',
 'Poli\xadtical partici\xadpation',
 'Poli\xadtical cul\xadture',
 'Civil liber\xadties']

Pandas accepts 'special string patterns' or regular expressiones (aka **Regex**). You should take advantage of that. For example, **\w** means every character from A-Z (or lower case) and  from 0-9. Let me use the opposite **\W** in the function [str.replace](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.replace.html) from Pandas:

In [8]:
DFwiki.columns.str.replace(pat="\W",# the string to replace
                           repl="", # the value to replace with
                           regex=True) # is 'pat' above a regex?

Index(['Country', 'Regimetype', 'Overallscore', 'Electoralprocessandpluralism',
       'Functioningofgovernment', 'Politicalparticipation', 'Politicalculture',
       'Civilliberties'],
      dtype='object')

In [9]:
#let's check with .to_list()
DFwiki.columns.str.replace("\W","",regex=True).to_list()

['Country',
 'Regimetype',
 'Overallscore',
 'Electoralprocessandpluralism',
 'Functioningofgovernment',
 'Politicalparticipation',
 'Politicalculture',
 'Civilliberties']

Now you can alter the column names:

In [10]:
DFwiki.columns=DFwiki.columns.str.replace("\W","",regex=True).to_list()

In this stage, **\W** got rid of _spaces_ and other non standard characters (did you see the dash in the web page?).

Notice:

* Some column names may be shortened. What changes would you make? Do you like what [pandas](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.slice.html) offer?

3. Check columns with strings as cells:

In [11]:
# for example
DFwiki.iloc[:,:2]

Unnamed: 0,Country,Regimetype
0,Full democracies,Full democracies
1,Norway,Full democracy
2,New Zealand,Full democracy
3,Finland,Full democracy
4,Sweden,Full democracy
...,...,...
166,Central African Republic,Authoritarian
167,Democratic Republic of the Congo,Authoritarian
168,North Korea,Authoritarian
169,Myanmar,Authoritarian


In general, you need to check that they do not have leading nor trailing spaces:

In [12]:
# basic Python (no pandas)
" Peru ".strip()

'Peru'

The basic Python generally will be useful, but let's keep learning what Pandas offers:

In [13]:
DFwiki.Country.str.strip()

0                      Full democracies
1                                Norway
2                           New Zealand
3                               Finland
4                                Sweden
                     ...               
166            Central African Republic
167    Democratic Republic of the Congo
168                         North Korea
169                             Myanmar
170                         Afghanistan
Name: Country, Length: 171, dtype: object

However, the scope of this function is the column (the series) and it can only be applied on one column at a time. That is,  you have to use it as many  times as there are columns in your data. 

In [14]:
# this will not work:

# DFwiki.iloc[:,:2].str.strip()

Fortunately, there is a way to **apply** that function to several columns:

In [15]:
# create function for multiple use:
stripSeveral=lambda x: x.str.strip() # x is a will be a series

#apply function just created
DFwiki.iloc[:,:2].apply(stripSeveral)

Unnamed: 0,Country,Regimetype
0,Full democracies,Full democracies
1,Norway,Full democracy
2,New Zealand,Full democracy
3,Finland,Full democracy
4,Sweden,Full democracy
...,...,...
166,Central African Republic,Authoritarian
167,Democratic Republic of the Congo,Authoritarian
168,North Korea,Authoritarian
169,Myanmar,Authoritarian


You can make the changes:

In [16]:
# let's actually make the changes!
DFwiki.iloc[:,:2]=DFwiki.iloc[:,:2].apply(stripSeveral)

4. Check levels of Categorical variables

In [17]:
# this is one categorical variable
DFwiki.iloc[:,1]

0      Full democracies
1        Full democracy
2        Full democracy
3        Full democracy
4        Full democracy
             ...       
166       Authoritarian
167       Authoritarian
168       Authoritarian
169       Authoritarian
170       Authoritarian
Name: Regimetype, Length: 171, dtype: object

You should always prepare a **frequency table** to detect possible errors:

In [18]:
DFwiki.Regimetype.value_counts()

Authoritarian            59
Flawed democracy         53
Hybrid regime            34
Full democracy           21
Full democracies          1
Flawed democracies        1
Hybrid regimes            1
Authoritarian regimes     1
Name: Regimetype, dtype: int64

Notice the similar values that only have one count. Those are not levels. If you visit the webpage you will see they are labels of sections in the table. 

Let's go step by step:

In [19]:
# save the frequency table:
tableCounts=DFwiki.Regimetype.value_counts()
tableCounts

Authoritarian            59
Flawed democracy         53
Hybrid regime            34
Full democracy           21
Full democracies          1
Flawed democracies        1
Hybrid regimes            1
Authoritarian regimes     1
Name: Regimetype, dtype: int64

In [20]:
# check the ONE counts
tableCounts[tableCounts==1]

Full democracies         1
Flawed democracies       1
Hybrid regimes           1
Authoritarian regimes    1
Name: Regimetype, dtype: int64

In [21]:
# keep the names of the rows as a list (indexes as list)
badLevels=tableCounts[tableCounts==1].index.to_list()
badLevels

['Full democracies',
 'Flawed democracies',
 'Hybrid regimes',
 'Authoritarian regimes']

Let's erase the rows with those values from the data:

In [22]:
# are these wrong?
DFwiki[DFwiki.Regimetype.isin(badLevels)]

Unnamed: 0,Country,Regimetype,Overallscore,Electoralprocessandpluralism,Functioningofgovernment,Politicalparticipation,Politicalculture,Civilliberties
0,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies
22,Flawed democracies,Flawed democracies,Flawed democracies,Flawed democracies,Flawed democracies,Flawed democracies,Flawed democracies,Flawed democracies
76,Hybrid regimes,Hybrid regimes,Hybrid regimes,Hybrid regimes,Hybrid regimes,Hybrid regimes,Hybrid regimes,Hybrid regimes
111,Authoritarian regimes,Authoritarian regimes,Authoritarian regimes,Authoritarian regimes,Authoritarian regimes,Authoritarian regimes,Authoritarian regimes,Authoritarian regimes


Then, we are good without them:

In [23]:
DFwiki=DFwiki[~DFwiki.Regimetype.isin(badLevels)]
DFwiki

Unnamed: 0,Country,Regimetype,Overallscore,Electoralprocessandpluralism,Functioningofgovernment,Politicalparticipation,Politicalculture,Civilliberties
1,Norway,Full democracy,9.75,10.00,9.64,10.00,10.00,9.12
2,New Zealand,Full democracy,9.37,10.00,8.93,9.44,8.75,9.71
3,Finland,Full democracy,9.27,10.00,9.29,8.89,8.75,9.41
4,Sweden,Full democracy,9.26,9.58,9.29,8.33,10.00,9.12
5,Iceland,Full democracy,9.18,10.00,8.21,8.89,9.38,9.41
...,...,...,...,...,...,...,...,...
166,Central African Republic,Authoritarian,1.43,1.25,0.00,1.67,1.88,2.35
167,Democratic Republic of the Congo,Authoritarian,1.40,0.75,0.00,2.22,3.13,0.88
168,North Korea,Authoritarian,1.08,0.00,2.50,1.67,1.25,0.00
169,Myanmar,Authoritarian,1.02,0.00,0.00,1.67,3.13,0.29


Notice that the amount of rows reported does not correlate with the number or rows seen. This always happens when you delete rows, because the row index does not update automatically. So, you will need to reset the indexes:

In [24]:
DFwiki.reset_index()

Unnamed: 0,index,Country,Regimetype,Overallscore,Electoralprocessandpluralism,Functioningofgovernment,Politicalparticipation,Politicalculture,Civilliberties
0,1,Norway,Full democracy,9.75,10.00,9.64,10.00,10.00,9.12
1,2,New Zealand,Full democracy,9.37,10.00,8.93,9.44,8.75,9.71
2,3,Finland,Full democracy,9.27,10.00,9.29,8.89,8.75,9.41
3,4,Sweden,Full democracy,9.26,9.58,9.29,8.33,10.00,9.12
4,5,Iceland,Full democracy,9.18,10.00,8.21,8.89,9.38,9.41
...,...,...,...,...,...,...,...,...,...
162,166,Central African Republic,Authoritarian,1.43,1.25,0.00,1.67,1.88,2.35
163,167,Democratic Republic of the Congo,Authoritarian,1.40,0.75,0.00,2.22,3.13,0.88
164,168,North Korea,Authoritarian,1.08,0.00,2.50,1.67,1.25,0.00
165,169,Myanmar,Authoritarian,1.02,0.00,0.00,1.67,3.13,0.29


Resetting the index shows the previous behavior by default, that is, it keeps the old index. In general, you do not want that, so you use the [function](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html) this way:

In [25]:
DFwiki.reset_index(drop=True, inplace=True)
#
DFwiki

Unnamed: 0,Country,Regimetype,Overallscore,Electoralprocessandpluralism,Functioningofgovernment,Politicalparticipation,Politicalculture,Civilliberties
0,Norway,Full democracy,9.75,10.00,9.64,10.00,10.00,9.12
1,New Zealand,Full democracy,9.37,10.00,8.93,9.44,8.75,9.71
2,Finland,Full democracy,9.27,10.00,9.29,8.89,8.75,9.41
3,Sweden,Full democracy,9.26,9.58,9.29,8.33,10.00,9.12
4,Iceland,Full democracy,9.18,10.00,8.21,8.89,9.38,9.41
...,...,...,...,...,...,...,...,...
162,Central African Republic,Authoritarian,1.43,1.25,0.00,1.67,1.88,2.35
163,Democratic Republic of the Congo,Authoritarian,1.40,0.75,0.00,2.22,3.13,0.88
164,North Korea,Authoritarian,1.08,0.00,2.50,1.67,1.25,0.00
165,Myanmar,Authoritarian,1.02,0.00,0.00,1.67,3.13,0.29


# II. Formatting


1. Check the data types

First, see what data types have been assigned by Python to each column:

In [26]:
DFwiki.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167 entries, 0 to 166
Data columns (total 8 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Country                       167 non-null    object
 1   Regimetype                    167 non-null    object
 2   Overallscore                  167 non-null    object
 3   Electoralprocessandpluralism  167 non-null    object
 4   Functioningofgovernment       167 non-null    object
 5   Politicalparticipation        167 non-null    object
 6   Politicalculture              167 non-null    object
 7   Civilliberties                167 non-null    object
dtypes: object(8)
memory usage: 10.6+ KB


You can pay attention to text case here, mainly for the column names:

* [Lower case](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.lower.html):

In [27]:
DFwiki.columns.str.lower()

Index(['country', 'regimetype', 'overallscore', 'electoralprocessandpluralism',
       'functioningofgovernment', 'politicalparticipation', 'politicalculture',
       'civilliberties'],
      dtype='object')

* Upper case

In [28]:
DFwiki.columns.str.upper()

Index(['COUNTRY', 'REGIMETYPE', 'OVERALLSCORE', 'ELECTORALPROCESSANDPLURALISM',
       'FUNCTIONINGOFGOVERNMENT', 'POLITICALPARTICIPATION', 'POLITICALCULTURE',
       'CIVILLIBERTIES'],
      dtype='object')

You also have options like [str.capitalize](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.capitalize.html) and [str.title](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.title.html).

Notice  from the **info()** that all your data types (_Dtypes_) are of the **object** kind. If the columns have texts, Python will say it is an **object**. That is OK for _Country_, but not for the others. _Regimetype_ is a **category**, and all the other columns are **numeric** values. Let's work on that:

2. Text to categorical



In [29]:
# this column is clean...
DFwiki.Regimetype.value_counts()

Authoritarian       59
Flawed democracy    53
Hybrid regime       34
Full democracy      21
Name: Regimetype, dtype: int64

In [30]:
# but NOT formatted:
DFwiki.Regimetype.dtype

dtype('O')

In the cleaning process we got rid of the wrong levels, now we need to set the right data type:

In [31]:
# import function for the categories to be set:
from pandas.api.types import CategoricalDtype

# prepare list order of levels (ascending when ordinal)
regimeLevels=["Authoritarian", "Hybrid regime","Flawed democracy", "Full democracy"]

# create custom data type
regimeOrdered = CategoricalDtype(categories=regimeLevels, ordered=True)

# set the Dtype of the column (one column):
DFwiki['Regimetype']=DFwiki.Regimetype.astype(regimeOrdered)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DFwiki['Regimetype']=DFwiki.Regimetype.astype(regimeOrdered)


See the changes:

In [32]:
DFwiki.Regimetype.dtype

CategoricalDtype(categories=['Authoritarian', 'Hybrid regime', 'Flawed democracy',
                  'Full democracy'],
, ordered=True)

3. Numbers that need to be numerical type

This is simple with [to_numeric](https://pandas.pydata.org/docs/reference/api/pandas.to_numeric.html) from pandas. See how it works:


In [33]:
#current dtype:
DFwiki.Overallscore.dtype

dtype('O')

In [34]:
#formatting ONE column:
pd.to_numeric(DFwiki.Overallscore,errors='coerce')

0      9.75
1      9.37
2      9.27
3      9.26
4      9.18
       ... 
162    1.43
163    1.40
164    1.08
165    1.02
166    0.32
Name: Overallscore, Length: 167, dtype: float64

However, that functions can not be applied to a set columns, but one column at a time.

In [35]:
# this will not work:
# pd.to_numeric(DFwiki.iloc[:,2:])

I showed you before how to _apply_ a function to several columns. Let me use that again:

Let's follow the last strategy:

In [36]:
# customize function
severalToNum=lambda x:pd.to_numeric(x,errors='coerce')
#pd.to_numeric,
#apply function:

where=DFwiki.columns[2:]

DFwiki.loc[:,where]=DFwiki.loc[:,where].apply(severalToNum)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DFwiki.loc[:,where]=DFwiki.loc[:,where].apply(severalToNum)
  DFwiki.loc[:,where]=DFwiki.loc[:,where].apply(severalToNum)


In [37]:
# result

DFwiki.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167 entries, 0 to 166
Data columns (total 8 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   Country                       167 non-null    object  
 1   Regimetype                    167 non-null    category
 2   Overallscore                  167 non-null    float64 
 3   Electoralprocessandpluralism  167 non-null    float64 
 4   Functioningofgovernment       167 non-null    float64 
 5   Politicalparticipation        167 non-null    float64 
 6   Politicalculture              167 non-null    float64 
 7   Civilliberties                167 non-null    float64 
dtypes: category(1), float64(6), object(1)
memory usage: 9.6+ KB


The DF now has the right data types. 

At this stage, it would be a good idea to save this work locally in a file:

In [38]:
# two options
DFwiki.to_csv("demoindex.csv",index=False)
DFwiki.to_pickle("demoindex.pkl")

CSV files are very common, but let me show a disadvantage:

In [39]:
democsv=pd.read_csv('demoindex.csv')
# see Dtypes:
democsv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167 entries, 0 to 166
Data columns (total 8 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Country                       167 non-null    object 
 1   Regimetype                    167 non-null    object 
 2   Overallscore                  167 non-null    float64
 3   Electoralprocessandpluralism  167 non-null    float64
 4   Functioningofgovernment       167 non-null    float64
 5   Politicalparticipation        167 non-null    float64
 6   Politicalculture              167 non-null    float64
 7   Civilliberties                167 non-null    float64
dtypes: float64(6), object(2)
memory usage: 10.6+ KB


Compare it to the pickle version:

In [40]:
demopkl=pd.read_pickle('demoindex.pkl')
#
demopkl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167 entries, 0 to 166
Data columns (total 8 columns):
 #   Column                        Non-Null Count  Dtype   
---  ------                        --------------  -----   
 0   Country                       167 non-null    object  
 1   Regimetype                    167 non-null    category
 2   Overallscore                  167 non-null    float64 
 3   Electoralprocessandpluralism  167 non-null    float64 
 4   Functioningofgovernment       167 non-null    float64 
 5   Politicalparticipation        167 non-null    float64 
 6   Politicalculture              167 non-null    float64 
 7   Civilliberties                167 non-null    float64 
dtypes: category(1), float64(6), object(1)
memory usage: 9.5+ KB
