<center><img src="http://i.imgur.com/sSaOozN.png" width="500"></center>

## Course: Computational Thinking for Governance Analytics

### Prof. José Manuel Magallanes, PhD 
* Visiting Professor of Computational Policy at Evans School of Public Policy and Governance, and eScience Institute Senior Data Science Fellow, University of Washington.
* Professor of Government and Political Methodology, Pontificia Universidad Católica del Perú. 

_____

# Data Preprocessing in Python: Data Integration and Reshaping

We all know collect data from different places. While the cleaning and formatting is done for each data source, we finally need to integrate all the sources into one to start the real analytical work.

I will use several data sets in this material. Let me start with the one from CIA on internet users:

In [1]:
import pandas as pd

demoLink="https://en.wikipedia.org/wiki/Democracy_Index"

demodata=pd.read_html(demoLink,header=0,flavor="bs4",attrs={'class':"wikitable sortable"})

You should remember by now that **demodata** is a _list_ of all the sortable tables from that URL. Let me recover the one we saw last week:

In [2]:
demoVars=demodata[4].copy()

Let's see some info:

In [3]:
demoVars.head()

Unnamed: 0,Rank,.mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}Δ Rank,Country,Regime type,Overall score,Δ Score,Elec­toral pro­cess and plura­lism,Func­tioning of govern­ment,Poli­tical partici­pation,Poli­tical cul­ture,Civil liber­ties
0,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies
1,1,,Norway,Full democracy,9.81,0.06,10.00,9.64,10.00,10.00,9.41
2,2,,Iceland,Full democracy,9.37,0.21,10.00,8.57,8.89,10.00,9.41
3,3,,Sweden,Full democracy,9.26,0.13,9.58,9.29,8.33,10.00,9.12
4,4,,New Zealand,Full democracy,9.25,0.01,10.00,8.93,8.89,8.75,9.71


Let's keep the columns we will  use:

In [4]:
whichToDrop=[0,1,5]
demoVars.drop(labels=demoVars.columns[whichToDrop],axis=1,inplace=True)

In [5]:
demoVars

Unnamed: 0,Country,Regime type,Overall score,Elec­toral pro­cess and plura­lism,Func­tioning of govern­ment,Poli­tical partici­pation,Poli­tical cul­ture,Civil liber­ties
0,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies,Full democracies
1,Norway,Full democracy,9.81,10.00,9.64,10.00,10.00,9.41
2,Iceland,Full democracy,9.37,10.00,8.57,8.89,10.00,9.41
3,Sweden,Full democracy,9.26,9.58,9.29,8.33,10.00,9.12
4,New Zealand,Full democracy,9.25,10.00,8.93,8.89,8.75,9.71
...,...,...,...,...,...,...,...,...
166,Chad,Authoritarian,1.55,0.00,0.00,1.67,3.75,2.35
167,Syria,Authoritarian,1.43,0.00,0.00,2.78,4.38,0.00
168,Central African Republic,Authoritarian,1.32,1.25,0.00,1.11,1.88,2.35
169,Democratic Republic of the Congo,Authoritarian,1.13,0.00,0.00,1.67,3.13,0.88


Let's clean the columns names:

In [6]:
# these are:
demoVars.columns

Index(['Country', 'Regime type', 'Overall score',
       'Elec­toral pro­cess and plura­lism', 'Func­tioning of govern­ment',
       'Poli­tical partici­pation', 'Poli­tical cul­ture', 'Civil liber­ties'],
      dtype='object')

Let's try one of those with a dash:

In [7]:
import re
re.sub("\\s","",demoVars.columns[5])

'Poli\xadticalpartici\xadpation'

The result shows you some hidden character. Let's use that info with the pandas replace:

In [8]:
demoVars.columns=demoVars.columns.str.replace("\\s|\\xad","",regex=True)

Let's clean the data contents. Notice that in the website some labels that are not needed are present in this case. Let's check the frequency table of "Regime Type" to try to identify the wrong labels that are affecting the data frame: 

In [9]:
demoVars['Regimetype'].value_counts()

Authoritarian            57
Flawed democracy         52
Hybrid regime            35
Full democracy           23
Full democracies          1
Flawed democracies        1
Hybrid regimes            1
Authoritarian regimes     1
Name: Regimetype, dtype: int64

In [10]:
# these are the wrong ones:
demoVars['Regimetype'].value_counts().index[4:]

Index(['Full democracies', 'Flawed democracies', 'Hybrid regimes',
       'Authoritarian regimes'],
      dtype='object')

In [11]:
#saving the wrong ones:
byeValues=demoVars['Regimetype'].value_counts().index[4:]

Now that we know which ones are not needed, we can filter the data frame rows using pandas' **isnin**:

In [12]:
demoVars = demoVars[~demoVars['Regimetype'].isin(byeValues)]
demoVars

Unnamed: 0,Country,Regimetype,Overallscore,Electoralprocessandpluralism,Functioningofgovernment,Politicalparticipation,Politicalculture,Civilliberties
1,Norway,Full democracy,9.81,10.00,9.64,10.00,10.00,9.41
2,Iceland,Full democracy,9.37,10.00,8.57,8.89,10.00,9.41
3,Sweden,Full democracy,9.26,9.58,9.29,8.33,10.00,9.12
4,New Zealand,Full democracy,9.25,10.00,8.93,8.89,8.75,9.71
5,Canada,Full democracy,9.24,9.58,8.93,8.89,9.38,9.41
...,...,...,...,...,...,...,...,...
166,Chad,Authoritarian,1.55,0.00,0.00,1.67,3.75,2.35
167,Syria,Authoritarian,1.43,0.00,0.00,2.78,4.38,0.00
168,Central African Republic,Authoritarian,1.32,1.25,0.00,1.11,1.88,2.35
169,Democratic Republic of the Congo,Authoritarian,1.13,0.00,0.00,1.67,3.13,0.88


Notice that the index **0** has dissapeared, and that even though you have 167 countries now, the last one has index **170**. When you filter rows that will happen; so it is better to reset the indexes of the data frame:

In [13]:
demoVars.reset_index(drop=True,inplace=True)
demoVars

Unnamed: 0,Country,Regimetype,Overallscore,Electoralprocessandpluralism,Functioningofgovernment,Politicalparticipation,Politicalculture,Civilliberties
0,Norway,Full democracy,9.81,10.00,9.64,10.00,10.00,9.41
1,Iceland,Full democracy,9.37,10.00,8.57,8.89,10.00,9.41
2,Sweden,Full democracy,9.26,9.58,9.29,8.33,10.00,9.12
3,New Zealand,Full democracy,9.25,10.00,8.93,8.89,8.75,9.71
4,Canada,Full democracy,9.24,9.58,8.93,8.89,9.38,9.41
...,...,...,...,...,...,...,...,...
162,Chad,Authoritarian,1.55,0.00,0.00,1.67,3.75,2.35
163,Syria,Authoritarian,1.43,0.00,0.00,2.78,4.38,0.00
164,Central African Republic,Authoritarian,1.32,1.25,0.00,1.11,1.88,2.35
165,Democratic Republic of the Congo,Authoritarian,1.13,0.00,0.00,1.67,3.13,0.88


Let's see the data types:

In [14]:
demoVars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 167 entries, 0 to 166
Data columns (total 8 columns):
 #   Column                        Non-Null Count  Dtype 
---  ------                        --------------  ----- 
 0   Country                       167 non-null    object
 1   Regimetype                    167 non-null    object
 2   Overallscore                  167 non-null    object
 3   Electoralprocessandpluralism  167 non-null    object
 4   Functioningofgovernment       167 non-null    object
 5   Politicalparticipation        167 non-null    object
 6   Politicalculture              167 non-null    object
 7   Civilliberties                167 non-null    object
dtypes: object(8)
memory usage: 10.6+ KB


The data types are wrong. In this situation, Python is not recognizing the numbers, so pandas will not offer basic stats:

In [15]:
demoVars.describe()

Unnamed: 0,Country,Regimetype,Overallscore,Electoralprocessandpluralism,Functioningofgovernment,Politicalparticipation,Politicalculture,Civilliberties
count,167,167,167.0,167.0,167.0,167.0,167.0,167.0
unique,167,4,155.0,48.0,44.0,18.0,15.0,33.0
top,Norway,Authoritarian,4.86,9.58,5.36,6.67,5.63,8.53
freq,1,57,2.0,26.0,16.0,23.0,31.0,11.0


In [16]:
toBeNumeric_Index=demoVars.columns[2:]
demoVars.loc[:,toBeNumeric_Index]=demoVars.iloc[:,2:].apply(pd.to_numeric, errors='raise')

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value, self.name)


In [None]:
demoVars.describe(include='all')

## Merging

Integrating data sets needs the following considerations:

* Merging is done on two data frames (you can prepare a function to merge more).
* You need a common column to be used in both data frames. The column names can be different.
* The merge can keep only the full coincidences, or also the values not matched, which will help you detect possible extra cleaning.
* Pandas differentiates the **left** from the **right** data frames.

Since I want to divide the number of internet users by the population, I need to **merge** both data frames. Let me show you several possibilities:

* **Option one**: merge only the coincidences:

In [None]:

ciainter.merge(ciapop)

The previous merge just got rid of any row that could not find the same country name in both data frames.

* **Option two**: merge when the column keys are different:

In [None]:
#let me rename the key column in 'ciapop':
ciapop.rename(columns={'Country':'countries'},inplace=True)

In [None]:
# this will give you an error:
ciainter.merge(ciapop)

In [None]:
# this is the rigth code:
ciainter.merge(ciapop,left_on='Country',right_on='countries')

You got the same result (with an extra column).

* **Option three**: you want to keep all the rows in the **left** data frame:

In [None]:
ciainter.merge(ciapop,left_on='Country',right_on='countries',how='left') 

* **Option four**: you want to keep all the rows in the **right** data frame:

In [None]:
ciainter.merge(ciapop,left_on='Country',right_on='countries',how='right') 

* **Option five**: you want to keep all the rows from **both** data frames:

In [None]:
ciainter.merge(ciapop,left_on='Country',right_on='countries',how='outer',indicator='True') 

Notice that I included the argument **indicator=True**, which added a column telling if the row comes from both, or from the left or rigth data frame.

### Looking for improvements after merging

Let me pay attention to this result again:

In [None]:
allRight=ciainter.merge(ciapop,left_on='Country',right_on='countries',how='left', indicator=True) 
allRight

The previous result is different from this one:

In [None]:
ciainter.merge(ciapop,left_on='Country',right_on='countries')

There is **one** row difference, let me see:

In [None]:
allRight[allRight._merge!='both']

I have found the only country that is not present in 'ciapop'. Imagine you had **The Antarctica** in *ciapop*, you could replace it like this:

In [None]:
###dictionary of replacements:
#replacementscia={'The Antarctica':'Antarctica'}

### replacing
#ciapop.countries.replace(replacementscia,inplace=True)

...and you will need to redo the merge.

Let me keep the **allRight** dataframe, erasing the irrelevant columns and rows:

In [None]:
# dropping columns
byeCols=['countries','_merge']
allRight.drop(columns=byeCols,inplace=True)

In [None]:
# dropping rows
byeRows=[217]
allRight.drop(index=byeRows,inplace=True)

In [None]:
When you erase 

____
____


### <font color="red">Saving File to Disk</font>

#### For future use in Python:

In [None]:
allRight.to_pickle("allRight.pkl")
# you will need: DF=pd.read_pickle("interhdi.pkl")
# or:
# from urllib.request import urlopen
# DF=pd.read_pickle(urlopen("https://..../interhdi.pkl"),compression=None)

#### For future  use in R:

In [None]:
from rpy2.robjects import pandas2ri
pandas2ri.activate()

from rpy2.robjects.packages import importr

base = importr('base')
base.saveRDS(allRight,file="allRight.RDS")

#In R, you call it with: DF = readRDS("interhdi.RDS")
#or, if iyou read from cloud: DF = readRDS(url("https://..../interhdi.RDS")

## RESHAPING

### Wide and Long format

The current format of **allRight** is known as the **WIDE** format. In this format, the variables are in every column, the most traditional one for spreadsheet users. Several functions are ready to use this format, for example:

In [None]:
# A scatter plot
allRight.plot.scatter(x='intusers', y='pob',grid=True)

In [None]:
# a boxplot
allRight.loc[:,['intusers','pob']].boxplot(vert=False,figsize=(15,5),grid=False)

However, the wide format may be less efficient for some packages:

In [None]:
#!pip install plotnine

In [None]:
import plotnine as p9

base=p9.ggplot(data=allRight)
base + p9.geom_boxplot(p9.aes(x=1,y='intusers')) + p9.geom_boxplot(p9.aes(x=2,y='pob'))

Let's see the **LONG** format:

In [None]:
allRight.melt(id_vars=['Country'])

The amount of of rows multiplies, but **all** the variables in the wide format will use only **TWO** columns in the wide format (in its basic form). Notice the difference in this code:

In [None]:
allRightLONG=allRight.melt(id_vars=['Country'])
base=p9.ggplot(data=allRightLONG)
base + p9.geom_boxplot(p9.aes(x='variable',y='value'))

### Transposing

We have two data sets on information about race, one for California and one for Washington State. These are the links:

In [None]:
# California link
linkCa='https://github.com/EvansDataScience/data/raw/master/CaliforniaRace.xlsx'

# Washington link
linkWa='https://github.com/EvansDataScience/data/raw/master/WAraceinfo.xlsx'

You can realize from the links that both data are in Excel format ( _xlsx_ ). Let's fetch them:

In [None]:
raceca=pd.read_excel(linkCa,0) # first sheet
racewa=pd.read_excel(linkWa,1) # second sheet

Let me see what **racewa** has:

In [None]:
racewa

The rows give you information on geographical units (the **unit of analysis** is the county). It apparently starts with information of the whole state (Washington), and then county by county. Notice that units of analysis repeat by group age and by year.

Now, let's see what **raceca** has:

In [None]:
raceca

Notice that the data from California speaks of the same, but the **units of analysis** (counties) appear in the columns. Notice that while WA State only shows counts, CA State also shows percentages. 

The data from WA State is a standard format for data frames, while the one in CA State is not. However, a simple operation known as **transposing** will solve the situation:

In [None]:
raceca.transpose()

In [None]:
# Let's make the changes:
raceca=raceca.transpose()

The transposed data frame requires several cleaning steps:

* Move first row as column names:

In [None]:
# first row, where the columns names are.
raceca.columns=raceca.iloc[0,:].to_list()

* Delete first row:

In [None]:
raceca.head()

In [None]:
# dropping first row effective immediatly
raceca.drop(index='Unnamed: 0',inplace=True)

* Keep the columns about **race**:

In [None]:
# finding positions:
list(enumerate(raceca.columns))

In [None]:
# values needed:
[0]+ list(range(23,31))

In [None]:
# keeping the ones I want:
raceca=raceca.iloc[:,[0]+ list(range(23,31))]
raceca

* Drop rows with missing values:

In [None]:
raceca.dropna(subset=['Statistics'],inplace=True)

When we drop rows, we reset indexes:

In [None]:
raceca.reset_index(drop=True,inplace=True)

In [None]:
# currently
raceca

This is a much  simpler data frame. 

### Aggregating

The data from WA State has data from different years, while the one from CA is just from 2019. Let's keep that year for WA:

In [None]:
racewa.query('Year==2019',inplace=True)

Now you have:

In [None]:
racewa

Notice that the data is organized by age in WA:

In [None]:
racewa['Age Group'].to_list()

There is a **Total** in **Age Group** that I will not use (that  makes this work simpler).

In [None]:
racewa=racewa[racewa['Age Group']!='Total']
racewa

The ages are organized in intervals, let's keep the consecutive ones:

In [None]:
stay=['0-19', '20-64', '65+']

racewa=racewa[racewa['Age Group'].isin(stay)]
racewa

* We should keep the values that do not include '__Washington__'

In [None]:
racewa=racewa[racewa["Area Name"]!='Washington']
racewa

The **aggregation** is used when you need to colapse rows. You can use different function for collapsing, in this case we will **sum** within each county, so I can get a total per county:

In [None]:
racewa=racewa.groupby(['Area Name','Area ID','Year']).sum()
racewa

The **Age Group** is not used in the aggreting function **groupby()**, but it is the only **non-numeric**  columns that is not used in this function. Notice that **Age Group** values have been concatenated, and the grouping variables are the **indexes** (row names).

You can drop the age group now:

In [None]:
racewa.drop(columns=['Age Group'],inplace=True)
racewa

### Appending

The units of analysis in both data frames are the same kind (counties) but they data from one data frame will not be a column for the other. In this situation, you do not **merge**, you **append**.

The condition for appending is that both data frame have the same colum names. Let's see:

In [None]:
raceca.columns

In [None]:
racewa.columns

The column 'Some Other Race Alone' in **raceca** has no similar value in **racewa**. Let's drop it:

In [None]:
raceca.drop(columns=['Some Other Race Alone'],inplace=True)

The columns in WA State have values for male and female. Since CA State do not have that, we have to get rid of those:

In [None]:
# good names
[name for name in racewa.columns if 'Total' in name]

In [None]:
# then
racewa=racewa.loc[:,[name for name in racewa.columns if 'Total' in name]]
racewa

We need to reciver the county names in racewa. They are part of the indexes. Let me use the **reset_index** funtion, but using the argument **drop=False**:

In [None]:
racewa.reset_index(drop=False,inplace=True)

In [None]:
# you have
racewa

The columns "Area ID" and "Year" are not present in raceca, we should drop them:

In [None]:
racewa.drop(columns=["Area ID", "Year"],inplace=True)
racewa

Let's see the names of both:

In [None]:
dict(zip(raceca.columns,racewa.columns))

You can use that dictionary to alter the names in raceca:
    

In [None]:
raceca.rename(columns=dict(zip(raceca.columns,racewa.columns)),inplace=True)
raceca

The **raceca** has the word "California"; since you are combining data from different states, it is better you keep that info:

In [None]:
# using 'expand'
raceca['Area Name'].str.split(pat=", ",expand=True)

The fucntion **str.split** creates two columns, let me save them here:

In [None]:
twoCols=raceca['Area Name'].str.split(pat=", ",expand=True)
twoCols

Use the columns to replace other columns:

In [None]:
raceca['Area Name']=twoCols[0]
raceca['State']=twoCols[1]

In [None]:
# we have
raceca

Let's drop the las row (the 'TOTAL'):

In [None]:
raceca.drop(index=[41],axis=0,inplace=True)

You can get rid of the 'County' string:

In [None]:
raceca['Area Name']=raceca['Area Name'].str.replace(" County","")
raceca

The data frame  **racewa** does not have a "State" column, let me create it:

In [None]:
racewa['State']='Washington'

Let's check the coincidences in  the column names:

In [None]:
racewa.columns==raceca.columns

Now we can append:

In [None]:
racewaca=racewa.append(raceca, ignore_index=True)
racewaca

Let's check the data types:

In [None]:
#checking
racewaca.info()

There are several columns that are numeric, but they have the wrong Dtype. Let's solve this:

In [None]:
# this require formatting
racewaca[racewaca.columns[1:-1]]

In [None]:
# let's do it:

racewaca[racewaca.columns[1:-1]]=racewaca[racewaca.columns[1:-1]].astype('float')

In [None]:
# checking
racewaca.info()