<center><img src="http://i.imgur.com/sSaOozN.png" width="500"></center>

## Course: Computational Thinking for Governance Analytics

### Prof. José Manuel Magallanes, PhD 
* Visiting Professor of Computational Policy at Evans School of Public Policy and Governance, and eScience Institute Senior Data Science Fellow, University of Washington.
* Professor of Government and Political Methodology, Pontificia Universidad Católica del Perú. 

_____

# Session 2:  Data Preprocessing in Python

<a id='beginning'></a>

Preprocessing includes three stages:

1. [Collecting](#part1) 
2. [Cleaning](#part2) 
3. [Formatting](#part3) 

____
<a id='part1'></a>

## Collecting

Collection is not a pre processing problem if you have data already nicely organized. When it is so, **pandas** will read the file without problem:

* Reading STATA file

In [3]:
import pandas as pd

stataFile='https://github.com/EvansDataScience/data/raw/master/lapopUSA2017_13.dta'
##
dataStata=pd.read_stata(stataFile,convert_categoricals=False)

* Reading EXCEL file

In [4]:
excelFile='https://github.com/EvansDataScience/data/raw/master/hdi2016.xlsx'
dataExcel=pd.read_excel(excelFile)

As you just have seen, a common file type can be easily read using pandas. Once you have them, you can ask many things, some could be:

* Number of rows and columns:

In [6]:
dataStata.shape

(1500, 119)

* See the first rows:

In [5]:
dataExcel.head()

Unnamed: 0,Country,Human Development Index (HDI),Life expectancy at birth,Expected years of schooling,Mean years of schooling,Gross national income (GNI) per capita,Category
0,Norway,0.949423,81.711,17.67187,12.74642,67614.35348,VERY HIGH HUMAN DEVELOPMENT
1,Australia,0.93868,82.537,20.43272,13.1751,42822.19627,VERY HIGH HUMAN DEVELOPMENT
2,Switzerland,0.939131,83.133,16.04041,13.37,56363.9578,VERY HIGH HUMAN DEVELOPMENT
3,Germany,0.925669,81.092,17.09594,13.187626,44999.64714,VERY HIGH HUMAN DEVELOPMENT
4,Denmark,0.924649,80.412,19.1888,12.70017,44518.92402,VERY HIGH HUMAN DEVELOPMENT


The property __shape__ is an attribute of the data frame, so you do not need parentheses; you do not these ones to call a function, in this last case __head()__. You also have __tail()__:

In [7]:
dataStata[['q12','q12m','q12f']].tail()

Unnamed: 0,q12,q12m,q12f
1495,2.0,1.0,1.0
1496,0.0,,
1497,0.0,,
1498,0.0,,
1499,1.0,1.0,0.0


The tail is showing only the last elements of the selected columns.

* Self-collecting

Another way to collect data is to create an online form and then get the data into Python. For this situation, go to this [link](https://goo.gl/forms/HX3KkxcEtXgMzyDJ3) and answer the questions.

Then, I will see the answers:


In [8]:
link='https://docs.google.com/spreadsheets/d/e/2PACX-1vT9zoGqRmUztGnDgjD4zWTQlttDyO5uRTotZQNQM_aUMrwhfOTf_0mpASz3Cra65VYF6DeCVxCm77Mj/pub?gid=1005402639&single=true&output=csv'
namesOfCols=['timeStamp','name','sex','age','bornIn','workingStatus']

myData = pd.read_csv(link,header=0,names=namesOfCols) #header = 0 means header is the first line

# here it is:
myData.head(10)

Unnamed: 0,timeStamp,name,sex,age,bornIn,workingStatus
0,1/30/2019 9:35:58,Nat,Man,29.0,CA,Intern with salary
1,2/4/2019 8:52:09,Ian Langer,Man,27.0,NC,Intern with salary
2,2/4/2019 11:35:41,Jenny,Woman,28.0,NM,Employed with salary
3,2/5/2019 19:29:26,Bradocius the Wise,An ideal,111.0,WA,Intern with salary
4,2/9/2019 19:33:48,Jose Luis Gomez-Angulo,Man,26.0,AZ,Not working
5,2/9/2019 19:34:41,Jose Luis Gomez-Angulo,Man,26.0,AZ,Not working
6,2/9/2019 19:35:42,Jose Luis Gomez-Angulo,Man,26.0,AZ,Not working
7,2/11/2019 13:15:18,Ana,Woman,26.0,GE,Not working
8,2/11/2019 13:17:18,Ana,Woman,26.0,Ge,Not working
9,2/11/2019 16:24:11,Todd Albertson,Man,25.0,CA,Not working


The information collected was retrieved in a comma-separated values format. This is a very common format. Notice the settings of __read_csv__: I tell pandas that the first row has the names of the columns (header is in position 0); and then I rename the headers.

* Collecting data from APIs

Most of these data come in more complex formats, like XML or JSON format. Let's get the data about 
[Seattle Real Time Fire 911 Calls](https://dev.socrata.com/foundry/data.seattle.gov/grwu-wqtk). Let me follow the instructions from that website to get the data:

In [10]:
!pip install sodapy # pip install sodapy

Collecting sodapy
  Downloading https://files.pythonhosted.org/packages/11/02/5baf6e10a47018babcc43e3ed03a6f13712187f4ab1fbe263b479c77d117/sodapy-1.5.2-py2.py3-none-any.whl
Installing collected packages: sodapy
Successfully installed sodapy-1.5.2


In [11]:
from sodapy import Socrata

client = Socrata("data.seattle.gov", None)
results = client.get("grwu-wqtk", limit=1000)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)
results_df.head()



Unnamed: 0,:@computed_region_2day_rhn5,:@computed_region_cyqu_gs94,:@computed_region_kuhn_3gp2,:@computed_region_q256_3sug,address,datetime,incident_number,latitude,longitude,report_location,type
0,,,29,19579,11528 28th Av Ne,2011-03-09T01:11:00.000,F110020765,47.7123,-122.297578,"{'type': 'Point', 'coordinates': [-122.297578,...",Medic Response
1,,,41,17919,3732 14th Av S,2017-12-10T03:10:00.000,F170128702,47.570516,-122.314759,"{'type': 'Point', 'coordinates': [-122.314759,...",Aid Response
2,,,44,18388,Beacon Av S / S Eddy St,2017-05-03T00:11:00.000,F170042740,47.545329,-122.30029,"{'type': 'Point', 'coordinates': [-122.30029, ...",MVI - Motor Vehicle Incident
3,,,22,18379,4th Av S / S Washington St,2017-04-11T06:41:00.000,F170035060,47.600875,-122.328963,"{'type': 'Point', 'coordinates': [-122.328963,...",Aid Response
4,,,34,18377,8830 Nesbit Av N,2012-02-07T18:31:00.000,F120011815,47.692638,-122.343471,"{'type': 'Point', 'coordinates': [-122.343471,...",Medic Response- 6 per Rule


[Go to page beginning](#beginning)

_____
<a id='part2'></a>

## Cleaning

The files opened above look clean because they may have been produce professionally for statistical work, or because the collecting tool was very restrictive. However, several data that you collect may not bring you the data as clean as the ones found in the previous codes. This commmon webpage has a table that may be needed:

In [12]:
wikiLink="https://en.wikipedia.org/wiki/List_of_freedom_indices"


import IPython
iframe = '<iframe src=' + wikiLink + ' width=700 height=350></iframe>'
IPython.display.HTML(iframe)



Let's try to get the table using pandas:

In [13]:
import pandas as pd

#attrs - what html attributes i am looking for (class & wikitable sortable are html definitions)

wikiTables=pd.read_html(wikiLink,header=0,attrs={'class': 'wikitable sortable'})

I tried to get all the tables. I may have more than one:

In [14]:
# What do I have? / How many?
type(wikiTables), len(wikiTables)

#pandas gives a collecion of wikitables, can be more than 1, and it is shown as list with 1 wikitable element

(list, 1)

I need to recover the first table from the list (the only one).

In [15]:
DF=wikiTables[0]

#what is it?
type(DF)

pandas.core.frame.DataFrame

Great!...we have a data frame; then:

In [16]:
DF.head()

Unnamed: 0,Country,Freedom in the World 2018[10],2018 Index of Economic Freedom[11],2018 Press Freedom Index[3],2017 Democracy Index[13]
0,Abkhazia,,,,
1,Afghanistan,,,,
2,Albania,,,,
3,Algeria,,,,
4,Andorra,,,,


This data frame does not look like the one we see on the website. We need to improve the call:

In [17]:
# install 'beautifulsoup4'
DF=pd.read_html(wikiLink,header=0,flavor='bs4',attrs={'class': 'wikitable sortable'})[0]
DF.head()

#flavor parameter tells us that panda will not use its own algorythms to scrape text, but the algorythms of "beautifulsoup4" package. To install package manually, use !pip install bs4

#NaN is a good way of showing missing value

Unnamed: 0,Country,Freedom in the World 2018[10],2018 Index of Economic Freedom[11],2018 Press Freedom Index[3],2017 Democracy Index[13]
0,Abkhazia,partly free,,,
1,Afghanistan,not free,mostly unfree,difficult situation,authoritarian regime
2,Albania,partly free,moderately free,noticeable problems,hybrid regime
3,Algeria,not free,repressed,difficult situation,authoritarian regime
4,Andorra,free,,satisfactory situation,


Combining BeautifulSoup (BS) and Pandas gave us the right result. But our work is not over.

Pay attention to the cleaning pandas+BS have done: the 'n/a' was interpreted as **NaN**; no country flags in the data; and the headers are in the right place. 

However, to prepare a final data set, we should pay attention to the headers names to avoid _blanks_, and erase the _footnote_ call.

We can have two strategies:
* Brute-force!

In [18]:
# if we had a small number of names to change, we can use brute-force strategy (writing the list of columns names):
DF.columns=['Country',
 'FreedomintheWorld',
 'IndexofEconomicFreedom',
 'PressFreedomIndex',
 'DemocracyIndex']
DF.head()

Unnamed: 0,Country,FreedomintheWorld,IndexofEconomicFreedom,PressFreedomIndex,DemocracyIndex
0,Abkhazia,partly free,,,
1,Afghanistan,not free,mostly unfree,difficult situation,authoritarian regime
2,Albania,partly free,moderately free,noticeable problems,hybrid regime
3,Algeria,not free,repressed,difficult situation,authoritarian regime
4,Andorra,free,,satisfactory situation,


* Using more computational thinking (algorithmic):

In [19]:
# if we had many columns, writing an algorith to rename the columns could be better:

# recalling the data:
DF=pd.read_html(wikiLink,header=0,flavor='bs4',attrs={'class': 'wikitable sortable'})[0]

I just recalled the data to do several steps:

1. Find blanks.
2. Find numbers.
3. Find brackets (opening and closing).

The previous requires a **regular expresssion**:

In [21]:
import re  # may need to be installed:

# find blanks: \\s+
# find numbers: \\d+ 
# find opening bracket : \\[
# find closing bracket: \\]

# You can combine using '|' (or):
pattern='\\s+|\\d+|\\[|\\]'
nothing=''

Now, let's see how this works for one case:

In [22]:
testString='Freedom in the World 2018[10]'
re.sub(pattern,nothing,testString) #re.sub - is a function of re package to substitute

'FreedomintheWorld'

Now, let's see how this works for ALL cases:

In [23]:
DF.columns

Index(['Country', 'Freedom in the World 2018[10]',
       '2018 Index of Economic Freedom[11]', '2018 Press Freedom Index[3]',
       '2017 Democracy Index[13]'],
      dtype='object')

In [24]:
#using comprehension

[re.sub(pattern,nothing,name) for name in DF.columns]

['Country',
 'FreedomintheWorld',
 'IndexofEconomicFreedom',
 'PressFreedomIndex',
 'DemocracyIndex']

We can verify we are matching well:

In [25]:
newNames=[re.sub(pattern,nothing,name) for name in DF.columns]

# checking:
list(zip(DF.columns,newNames))

[('Country', 'Country'),
 ('Freedom in the World 2018[10]', 'FreedomintheWorld'),
 ('2018 Index of Economic Freedom[11]', 'IndexofEconomicFreedom'),
 ('2018 Press Freedom Index[3]', 'PressFreedomIndex'),
 ('2017 Democracy Index[13]', 'DemocracyIndex')]

Let's turn that match into a dictionary:

In [26]:
{old:new for old,new in zip(DF.columns,newNames)}

{'Country': 'Country',
 'Freedom in the World 2018[10]': 'FreedomintheWorld',
 '2018 Index of Economic Freedom[11]': 'IndexofEconomicFreedom',
 '2018 Press Freedom Index[3]': 'PressFreedomIndex',
 '2017 Democracy Index[13]': 'DemocracyIndex'}

Once you have a dict like that one, you can use it to rename the columns with another function:

In [27]:
changes={old:new for old,new in zip(DF.columns,newNames)}
DF.rename(columns=changes,inplace=True) #inplace - do this thing and change it immediately in the dataframe

If you had a set of new names, and you do not want to change every column name, that is the correct way to do it.

Let's see the result:

In [28]:
DF.head()

Unnamed: 0,Country,FreedomintheWorld,IndexofEconomicFreedom,PressFreedomIndex,DemocracyIndex
0,Abkhazia,partly free,,,
1,Afghanistan,not free,mostly unfree,difficult situation,authoritarian regime
2,Albania,partly free,moderately free,noticeable problems,hybrid regime
3,Algeria,not free,repressed,difficult situation,authoritarian regime
4,Andorra,free,,satisfactory situation,


A next step will be verifying if the answers are well coded:

In [29]:
DF.iloc[:,1::].describe() #no need to check Country column as they are unique

Unnamed: 0,FreedomintheWorld,IndexofEconomicFreedom,PressFreedomIndex,DemocracyIndex
count,206,180,189,167
unique,3,5,5,4
top,free,mostly unfree,noticeable problems,flawed democracy
freq,89,63,63,57


What were you looking for? 
Sometimes a category may be wrongly written in a cell, for instance, if you had 'Free' and 'free' or 'free ' to represent the same in one column, you have a mistake. Let's see if there is one here:

In [30]:
DF.FreedomintheWorld.value_counts()

free           89
partly free    62
not free       55
Name: FreedomintheWorld, dtype: int64

 What we see is that this variable has its own correct set of answers. 
 
 We can try that approach for each variable, but we can check the whole group of categorical values like thisL

In [31]:
# DF.iloc[:,1::] all columns but the first one
# apply(set)  apply the function 'set()'  per column # give unique values
# tolist() convert to a list 

DF.iloc[:,1::].apply(set).tolist()

[{'free', nan, 'not free', 'partly free'},
 {'free', 'moderately free', 'mostly free', 'mostly unfree', nan, 'repressed'},
 {'difficult situation',
  'good situation',
  nan,
  'noticeable problems',
  'satisfactory situation',
  'very serious situation'},
 {'authoritarian regime',
  'flawed democracy',
  'full democracy',
  'hybrid regime',
  nan}]

[Go to page beginning](#beginning)
____
<a id='part3'></a>
## Formatting

The data seems _clean_, but we need now to be sure the information is in the right format. This varies according to the project; so, let me show you some steps during of the formatting stage.

1. Verify the data types:


In [32]:
DF.dtypes

Country                   object
FreedomintheWorld         object
IndexofEconomicFreedom    object
PressFreedomIndex         object
DemocracyIndex            object
dtype: object

All but the first variable are categories, not text (_object_). To convert them into categories you can do this:

In [33]:
headerNames=DF.columns
DF[headerNames[1:]]=DF[headerNames[1:]].astype('category')

When a variable is of categorical type, you can use particular functions for them:

In [34]:
DF.FreedomintheWorld.cat.categories #doesnt show missing values, bcz its already a category

Index(['free', 'not free', 'partly free'], dtype='object')

In [35]:
DF.IndexofEconomicFreedom.cat.categories

Index(['free', 'moderately free', 'mostly free', 'mostly unfree', 'repressed'], dtype='object')

In [37]:
DF.PressFreedomIndex.cat.categories

Index(['difficult situation', 'good situation', 'noticeable problems',
       'satisfactory situation', 'very serious situation'],
      dtype='object')

In [36]:
DF.DemocracyIndex.cat.categories

Index(['authoritarian regime', 'flawed democracy', 'full democracy',
       'hybrid regime'],
      dtype='object')

2. If ordinal, make the adjustment.

The order in which the categories differentiate a plain categorical from an ordinal categorical. They should be categorical but the order does not reflect the order it should. 

We can turn it into an ordinal doing the following:

a. Find a good numeric sequence for the ordinal values:

In [38]:
# notice I am using the numbers in the same order as the list of categorical values:
oldFree=list(DF.FreedomintheWorld.cat.categories)

# '5 very good' / '4 good' / '3 middle' / '2 bad' / '1 very bad'

#mapping

newFree=[5,1,3]
recodeFree={old:new for old,new in zip (oldFree,newFree)}

oldEco=list(DF.IndexofEconomicFreedom.cat.categories)
newEco=[5,3,4,2,1]
recodeEco={old:new for old,new in zip (oldEco,newEco)}

oldPress=list(DF.PressFreedomIndex.cat.categories)
newPress=[2,5,3,4,1]
recodePress={old:new for old,new in zip (oldPress,newPress)}

oldDemo=list(DF.DemocracyIndex.cat.categories)
newDemo=[1,4,5,2]
recodeDemo={old:new for old,new in zip (oldDemo,newDemo)}

b. Rename the still plain categorical:

In [39]:
DF.FreedomintheWorld.cat.rename_categories(recodeFree,inplace=True)

DF.IndexofEconomicFreedom.cat.rename_categories(recodeEco,inplace=True)

DF.PressFreedomIndex.cat.rename_categories(recodePress,inplace=True)

DF.DemocracyIndex.cat.rename_categories(recodeDemo,inplace=True)

# veamos:
DF.head(10)

Unnamed: 0,Country,FreedomintheWorld,IndexofEconomicFreedom,PressFreedomIndex,DemocracyIndex
0,Abkhazia,3,,,
1,Afghanistan,1,2.0,2.0,1.0
2,Albania,3,3.0,3.0,2.0
3,Algeria,1,1.0,2.0,1.0
4,Andorra,5,,4.0,
5,Angola,1,1.0,2.0,1.0
6,Antigua and Barbuda,5,,4.0,
7,Argentina,5,2.0,3.0,4.0
8,Armenia,3,3.0,3.0,2.0
9,Australia,5,5.0,4.0,5.0


In [41]:
DF.dtypes

Country                     object
FreedomintheWorld         category
IndexofEconomicFreedom    category
PressFreedomIndex         category
DemocracyIndex            category
dtype: object

c. Now turn the renamed columns into a numeric values:

In [42]:
DF[headerNames[1:]]=DF[headerNames[1:]].apply(pd.to_numeric)

Let me verify:

In [43]:
DF.head()

Unnamed: 0,Country,FreedomintheWorld,IndexofEconomicFreedom,PressFreedomIndex,DemocracyIndex
0,Abkhazia,3.0,,,
1,Afghanistan,1.0,2.0,2.0,1.0
2,Albania,3.0,3.0,3.0,2.0
3,Algeria,1.0,1.0,2.0,1.0
4,Andorra,5.0,,4.0,


3. Try solving missing data presence

The data has some missing data:

In [44]:
DF.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207 entries, 0 to 206
Data columns (total 5 columns):
Country                   207 non-null object
FreedomintheWorld         206 non-null float64
IndexofEconomicFreedom    180 non-null float64
PressFreedomIndex         189 non-null float64
DemocracyIndex            167 non-null float64
dtypes: float64(4), object(1)
memory usage: 8.2+ KB


Now comes the thinking: How to replace the missing values?

Python can easily find and replace every missing value; but our strategy will be different:

* _Freedom in the World_ has the least missing values, we will use this variable to see how the others behave.

* Since the variables are ordinals (even though they are numbers now) a good candidate to impute a missing is the median NOT the mean (you can not compute the mean of an ordinal).

Let's see:

In [45]:
#median per group: 
DF.groupby('FreedomintheWorld')[headerNames[2:]].median()

Unnamed: 0_level_0,IndexofEconomicFreedom,PressFreedomIndex,DemocracyIndex
FreedomintheWorld,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0,2.0,2.0,1.0
3.0,2.0,3.0,2.0
5.0,3.0,4.0,4.0


We need to replace those medians whenever a missing value is found:

In [46]:
for col in headerNames[2:]:
    # in each column, get median by FIW group, and use it to replace the missing values.
    DF[col].fillna(DF.groupby(["FreedomintheWorld"])[col].transform("median"), inplace=True)

In [47]:
DF.head(20)

Unnamed: 0,Country,FreedomintheWorld,IndexofEconomicFreedom,PressFreedomIndex,DemocracyIndex
0,Abkhazia,3.0,2.0,3.0,2.0
1,Afghanistan,1.0,2.0,2.0,1.0
2,Albania,3.0,3.0,3.0,2.0
3,Algeria,1.0,1.0,2.0,1.0
4,Andorra,5.0,3.0,4.0,4.0
5,Angola,1.0,1.0,2.0,1.0
6,Antigua and Barbuda,5.0,3.0,4.0,4.0
7,Argentina,5.0,2.0,3.0,4.0
8,Armenia,3.0,3.0,3.0,2.0
9,Australia,5.0,5.0,4.0,5.0


We can send this to R, in a simple CSV format:

In [48]:
DF.to_csv("indexes.csv",index=None)

______

## More examples

### Case: Democracy Index

Let me clean a similar data from wikipedia, about democracy index:

In [70]:
import pandas as pd #location:
demoLink = "https://en.wikipedia.org/wiki/Democracy_Index" 

#collection
demodex=pd.read_html(demoLink,header=0,flavor='bs4',attrs={'class': 'wikitable sortable'})[0]

1. Looking for messiness:

In [71]:
# what's on top?
# names? weird symbols? more links?
demodex.head(10)

Unnamed: 0,Rank,Country,Score,Electoral processand pluralism,Functioning ofgovernment,Politicalparticipation,Politicalculture,Civilliberties,Category
0,1,Norway,9.87,10.0,9.64,10.0,10.0,9.71,Full democracy
1,2,Iceland,9.58,10.0,9.29,8.89,10.0,9.71,Full democracy
2,3,Sweden,9.39,9.58,9.64,8.33,10.0,9.41,Full democracy
3,4,New Zealand,9.26,10.0,9.29,8.89,8.13,10.0,Full democracy
4,5,Denmark,9.22,10.0,9.29,8.33,9.38,9.12,Full democracy
5,=6,Ireland,9.15,9.58,7.86,8.33,10.0,10.0,Full democracy
6,=6,Canada,9.15,9.58,9.64,7.78,8.75,10.0,Full democracy
7,8,Finland,9.14,10.0,8.93,8.33,8.75,9.71,Full democracy
8,9,Australia,9.09,10.0,8.93,7.78,8.75,10.0,Full democracy
9,10,Switzerland,9.03,9.58,9.29,7.78,9.38,9.12,Full democracy


In [73]:
# what's at the bottom?
# note? credits? extra info?

demodex.tail(10)

Unnamed: 0,Rank,Country,Score,Electoral processand pluralism,Functioning ofgovernment,Politicalparticipation,Politicalculture,Civilliberties,Category
158,=159,Saudi Arabia,1.93,0.00,2.86,2.22,3.13,1.47,Authoritarian
159,=159,Tajikistan,1.93,0.08,0.79,1.67,6.25,0.88,Authoritarian
160,161,Equatorial Guinea,1.92,0.00,0.43,3.33,4.38,1.47,Authoritarian
161,162,Turkmenistan,1.72,0.00,0.79,2.22,5.00,0.59,Authoritarian
162,163,Chad,1.61,0.00,0.00,1.67,3.75,2.65,Authoritarian
163,164,Central African Republic,1.52,2.25,0.00,1.11,1.88,2.35,Authoritarian
164,165,Democratic Republic of the Congo,1.49,0.50,0.71,2.22,3.13,0.88,Authoritarian
165,166,Syria,1.43,0.00,0.00,2.78,4.38,0.00,Authoritarian
166,167,North Korea,1.08,0.00,2.50,1.67,1.25,0.00,Authoritarian
167,Rank,Country,Score,Electoral processand pluralism,Functioning ofgovernment,Politicalparticipation,Politicalculture,Civilliberties,Category


First, we see a column that have some messiness (symbol "=" in rank), but which can be deleted as their information is not relevant. Let me get rid of the _Score_, as it is just the mean of the other ones. The last row is the repetition of the headers, so that one should go, too:

In [75]:
#bye row 167, and two columns
demodexClean=demodex.drop(index=167,columns=['Rank','Score'])

In [76]:
demodexClean

Unnamed: 0,Country,Electoral processand pluralism,Functioning ofgovernment,Politicalparticipation,Politicalculture,Civilliberties,Category
0,Norway,10.00,9.64,10.00,10.00,9.71,Full democracy
1,Iceland,10.00,9.29,8.89,10.00,9.71,Full democracy
2,Sweden,9.58,9.64,8.33,10.00,9.41,Full democracy
3,New Zealand,10.00,9.29,8.89,8.13,10.00,Full democracy
4,Denmark,10.00,9.29,8.33,9.38,9.12,Full democracy
5,Ireland,9.58,7.86,8.33,10.00,10.00,Full democracy
6,Canada,9.58,9.64,7.78,8.75,10.00,Full democracy
7,Finland,10.00,8.93,8.33,8.75,9.71,Full democracy
8,Australia,10.00,8.93,7.78,8.75,10.00,Full democracy
9,Switzerland,9.58,9.29,7.78,9.38,9.12,Full democracy


As there are few names, we can change to smaller sizes:

In [77]:
newNames=['plularism','effectiveness','participation','culture','liberties']

# names from the second and before the last one '[1:-1]':
newMapper={old:new for old,new in zip(demodexClean.columns[1:-1],newNames)}

demodexClean.rename(columns=newMapper,inplace=True)

In [78]:
# this is what we have so far:
demodexClean.head()

Unnamed: 0,Country,plularism,effectiveness,participation,culture,liberties,Category
0,Norway,10.0,9.64,10.0,10.0,9.71,Full democracy
1,Iceland,10.0,9.29,8.89,10.0,9.71,Full democracy
2,Sweden,9.58,9.64,8.33,10.0,9.41,Full democracy
3,New Zealand,10.0,9.29,8.89,8.13,10.0,Full democracy
4,Denmark,10.0,9.29,8.33,9.38,9.12,Full democracy


It looks good so far. Let's go to formatting.

2. Giving the rigth format:

In [79]:
# checking data types:
demodexClean.dtypes

Country          object
plularism        object
effectiveness    object
participation    object
culture          object
liberties        object
Category         object
dtype: object

Above, we realized the need to make some indices into numeric:

In [80]:
demodexClean[newNames]=demodexClean[newNames].apply(pd.to_numeric)

The last one is a categorical variable:

In [81]:
demodexClean.Category.value_counts()

Flawed democracy       54
Authoritarian          53
Hybrid regime          39
Full democracy         20
Flawed democracy[a]     1
Name: Category, dtype: int64

When you have text, you could get the unique values of a column like this:

In [82]:
pd.unique(demodexClean.Category).tolist()

['Full democracy',
 'Flawed democracy[a]',
 'Flawed democracy',
 'Hybrid regime',
 'Authoritarian']

Then, you can prepare the map to recode the values:

In [83]:
oldValues=pd.unique(demodexClean.Category).tolist()
newValues=[4,3,3,2,1]
mapNewOld={old:new for old,new in zip(oldValues,newValues)}
mapNewOld

{'Full democracy': 4,
 'Flawed democracy[a]': 3,
 'Flawed democracy': 3,
 'Hybrid regime': 2,
 'Authoritarian': 1}

You can do it in this way:

In [84]:
demodexClean.Category.replace(mapNewOld,inplace=True)

In [None]:
# or this one:
# demodexClean.Category=demodexClean.Category.replace(mapNewOld)

You can save it as a category, but that will be lost if sent to R:

In [85]:
demodexClean.Category=demodexClean.Category.astype('category')

In [86]:
demodexClean['Category'].cat.categories

Int64Index([1, 2, 3, 4], dtype='int64')

In [87]:
# checking missing values

demodexClean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 167 entries, 0 to 166
Data columns (total 7 columns):
Country          167 non-null object
plularism        167 non-null float64
effectiveness    167 non-null float64
participation    167 non-null float64
culture          167 non-null float64
liberties        167 non-null float64
Category         167 non-null category
dtypes: category(1), float64(5), object(1)
memory usage: 9.5+ KB


This data is now ready for R.

In [88]:
demodexClean.to_csv("democracyIndex.csv",index=None)

### The case of Medicare:

Here I will use data from [Medicare Beneficiary Enrollment and Demographics](https://dev.socrata.com/foundry/data.wa.gov/2cup-2fnu)

In [113]:
import requests

# This time I am talking to the API from DATA.WA.GOV
url = "https://data.wa.gov/resource/2cup-2fnu.json?year=2014"
response = requests.get(url)
if response.status_code == 200:
    medicare = response.json()

In [114]:
# turning json into DF:
medicare2014 = pd.DataFrame(medicare)

In [115]:
medicare2014.head()

Unnamed: 0,average_age,average_hcc_score,beneficiaries_with_part_a_and_part_b,county,ffs_beneficiaries,ma_beneficiaries,ma_participation_rate,percent_african_american,percent_eligible_for_medicaid,percent_female,percent_hispanic,percent_male,percent_non_hispanic_white,percent_other_unknown,state_and_county_fips_code,to_sort_by_county_and_year,to_sort_by_year_and_county,year
0,71.0,0.9,1098715,STATE TOTAL,739717,358998,32.7,2.6,19.1,53.2,3.4,46.9,86.3,7.7,.,0.2014,2014,2014
1,73.0,0.86,1557,ADAMS,1333,224,14.4,,15.6,51.2,,48.8,,,53001,530012014.0,201453001,2014
2,71.0,0.93,5426,ASOTIN,4515,911,16.8,,18.5,51.8,,48.2,,,53003,530032014.0,201453003,2014
3,71.0,0.92,28303,BENTON,24054,4249,15.0,0.8,15.5,53.8,5.1,46.2,90.0,4.1,53005,530052014.0,201453005,2014
4,72.0,0.86,11040,CHELAN,8884,2156,19.5,0.2,20.2,51.3,5.4,48.7,91.7,2.7,53007,530072014.0,201453007,2014


In [92]:
medicare2014.tail()

Unnamed: 0,average_age,average_hcc_score,beneficiaries_with_part_a_and_part_b,county,ffs_beneficiaries,ma_beneficiaries,ma_participation_rate,percent_african_american,percent_eligible_for_medicaid,percent_female,percent_hispanic,percent_male,percent_non_hispanic_white,percent_other_unknown,state_and_county_fips_code,to_sort_by_county_and_year,to_sort_by_year_and_county,year
35,72.0,0.83,1249,WAHKIAKUM,935,314,25.1,,13.4,48.0,,52.0,,,53069,530692014,201453069,2014
36,73.0,0.95,11149,WALLA WALLA,9724,1425,12.8,0.6,17.6,55.2,5.7,44.9,91.1,2.7,53071,530712014,201453071,2014
37,71.0,0.89,36244,WHATCOM,22704,13540,37.4,0.9,21.0,52.3,2.7,47.7,89.3,7.1,53073,530732014,201453073,2014
38,73.0,0.86,4961,WHITMAN,4770,191,3.9,0.5,13.4,53.9,0.9,46.1,95.0,3.7,53075,530752014,201453075,2014
39,71.0,0.94,37337,YAKIMA,29424,7913,21.2,1.0,25.8,53.5,15.5,46.5,78.2,5.4,53077,530772014,201453077,2014


The first row is the total, it has to go:

In [94]:
#this one?
medicare2014.drop(index=0).head()

Unnamed: 0,average_age,average_hcc_score,beneficiaries_with_part_a_and_part_b,county,ffs_beneficiaries,ma_beneficiaries,ma_participation_rate,percent_african_american,percent_eligible_for_medicaid,percent_female,percent_hispanic,percent_male,percent_non_hispanic_white,percent_other_unknown,state_and_county_fips_code,to_sort_by_county_and_year,to_sort_by_year_and_county,year
1,73.0,0.86,1557,ADAMS,1333,224,14.4,,15.6,51.2,,48.8,,,53001,530012014,201453001,2014
2,71.0,0.93,5426,ASOTIN,4515,911,16.8,,18.5,51.8,,48.2,,,53003,530032014,201453003,2014
3,71.0,0.92,28303,BENTON,24054,4249,15.0,0.8,15.5,53.8,5.1,46.2,90.0,4.1,53005,530052014,201453005,2014
4,72.0,0.86,11040,CHELAN,8884,2156,19.5,0.2,20.2,51.3,5.4,48.7,91.7,2.7,53007,530072014,201453007,2014
5,73.0,0.81,22429,CLALLAM,20336,2093,9.3,0.3,12.5,52.6,1.3,47.4,93.3,5.1,53009,530092014,201453009,2014


In [95]:
#or this one?
medicare2014.drop(index=0).reset_index().head()

Unnamed: 0,index,average_age,average_hcc_score,beneficiaries_with_part_a_and_part_b,county,ffs_beneficiaries,ma_beneficiaries,ma_participation_rate,percent_african_american,percent_eligible_for_medicaid,percent_female,percent_hispanic,percent_male,percent_non_hispanic_white,percent_other_unknown,state_and_county_fips_code,to_sort_by_county_and_year,to_sort_by_year_and_county,year
0,1,73.0,0.86,1557,ADAMS,1333,224,14.4,,15.6,51.2,,48.8,,,53001,530012014,201453001,2014
1,2,71.0,0.93,5426,ASOTIN,4515,911,16.8,,18.5,51.8,,48.2,,,53003,530032014,201453003,2014
2,3,71.0,0.92,28303,BENTON,24054,4249,15.0,0.8,15.5,53.8,5.1,46.2,90.0,4.1,53005,530052014,201453005,2014
3,4,72.0,0.86,11040,CHELAN,8884,2156,19.5,0.2,20.2,51.3,5.4,48.7,91.7,2.7,53007,530072014,201453007,2014
4,5,73.0,0.81,22429,CLALLAM,20336,2093,9.3,0.3,12.5,52.6,1.3,47.4,93.3,5.1,53009,530092014,201453009,2014


In [96]:
#or this one?
medicare2014.drop(index=0).reset_index(drop=True).head()

Unnamed: 0,average_age,average_hcc_score,beneficiaries_with_part_a_and_part_b,county,ffs_beneficiaries,ma_beneficiaries,ma_participation_rate,percent_african_american,percent_eligible_for_medicaid,percent_female,percent_hispanic,percent_male,percent_non_hispanic_white,percent_other_unknown,state_and_county_fips_code,to_sort_by_county_and_year,to_sort_by_year_and_county,year
0,73.0,0.86,1557,ADAMS,1333,224,14.4,,15.6,51.2,,48.8,,,53001,530012014,201453001,2014
1,71.0,0.93,5426,ASOTIN,4515,911,16.8,,18.5,51.8,,48.2,,,53003,530032014,201453003,2014
2,71.0,0.92,28303,BENTON,24054,4249,15.0,0.8,15.5,53.8,5.1,46.2,90.0,4.1,53005,530052014,201453005,2014
3,72.0,0.86,11040,CHELAN,8884,2156,19.5,0.2,20.2,51.3,5.4,48.7,91.7,2.7,53007,530072014,201453007,2014
4,73.0,0.81,22429,CLALLAM,20336,2093,9.3,0.3,12.5,52.6,1.3,47.4,93.3,5.1,53009,530092014,201453009,2014


When we use inplace, we should not concatenate:

In [97]:
medicare2014.drop(index=0,inplace=True)
medicare2014.reset_index(drop=True,inplace=True)

The result so far:

In [98]:
medicare2014.head()

Unnamed: 0,average_age,average_hcc_score,beneficiaries_with_part_a_and_part_b,county,ffs_beneficiaries,ma_beneficiaries,ma_participation_rate,percent_african_american,percent_eligible_for_medicaid,percent_female,percent_hispanic,percent_male,percent_non_hispanic_white,percent_other_unknown,state_and_county_fips_code,to_sort_by_county_and_year,to_sort_by_year_and_county,year
0,73.0,0.86,1557,ADAMS,1333,224,14.4,,15.6,51.2,,48.8,,,53001,530012014,201453001,2014
1,71.0,0.93,5426,ASOTIN,4515,911,16.8,,18.5,51.8,,48.2,,,53003,530032014,201453003,2014
2,71.0,0.92,28303,BENTON,24054,4249,15.0,0.8,15.5,53.8,5.1,46.2,90.0,4.1,53005,530052014,201453005,2014
3,72.0,0.86,11040,CHELAN,8884,2156,19.5,0.2,20.2,51.3,5.4,48.7,91.7,2.7,53007,530072014,201453007,2014
4,73.0,0.81,22429,CLALLAM,20336,2093,9.3,0.3,12.5,52.6,1.3,47.4,93.3,5.1,53009,530092014,201453009,2014


In [99]:
# what we have
medicare2014.dtypes

average_age                             object
average_hcc_score                       object
beneficiaries_with_part_a_and_part_b    object
county                                  object
ffs_beneficiaries                       object
ma_beneficiaries                        object
ma_participation_rate                   object
percent_african_american                object
percent_eligible_for_medicaid           object
percent_female                          object
percent_hispanic                        object
percent_male                            object
percent_non_hispanic_white              object
percent_other_unknown                   object
state_and_county_fips_code              object
to_sort_by_county_and_year              object
to_sort_by_year_and_county              object
year                                    object
dtype: object

Notice that the three variables before the last one, and county should be kept as objects, while the other should be numeric:

In [100]:
# get original order:
original=medicare2014.columns.tolist()
original

['average_age',
 'average_hcc_score',
 'beneficiaries_with_part_a_and_part_b',
 'county',
 'ffs_beneficiaries',
 'ma_beneficiaries',
 'ma_participation_rate',
 'percent_african_american',
 'percent_eligible_for_medicaid',
 'percent_female',
 'percent_hispanic',
 'percent_male',
 'percent_non_hispanic_white',
 'percent_other_unknown',
 'state_and_county_fips_code',
 'to_sort_by_county_and_year',
 'to_sort_by_year_and_county',
 'year']

In [101]:
# new order:  (no need for * if one element)
newOrder=[original[3],*original[14:], *original[0:3],*original[4:14],] # using '*'
newOrder

['county',
 'state_and_county_fips_code',
 'to_sort_by_county_and_year',
 'to_sort_by_year_and_county',
 'year',
 'average_age',
 'average_hcc_score',
 'beneficiaries_with_part_a_and_part_b',
 'ffs_beneficiaries',
 'ma_beneficiaries',
 'ma_participation_rate',
 'percent_african_american',
 'percent_eligible_for_medicaid',
 'percent_female',
 'percent_hispanic',
 'percent_male',
 'percent_non_hispanic_white',
 'percent_other_unknown']

In [102]:
# there are differenet data types, let me move columns:

medicare2014=medicare2014[newOrder]
medicare2014.head()

Unnamed: 0,county,state_and_county_fips_code,to_sort_by_county_and_year,to_sort_by_year_and_county,year,average_age,average_hcc_score,beneficiaries_with_part_a_and_part_b,ffs_beneficiaries,ma_beneficiaries,ma_participation_rate,percent_african_american,percent_eligible_for_medicaid,percent_female,percent_hispanic,percent_male,percent_non_hispanic_white,percent_other_unknown
0,ADAMS,53001,530012014,201453001,2014,73.0,0.86,1557,1333,224,14.4,,15.6,51.2,,48.8,,
1,ASOTIN,53003,530032014,201453003,2014,71.0,0.93,5426,4515,911,16.8,,18.5,51.8,,48.2,,
2,BENTON,53005,530052014,201453005,2014,71.0,0.92,28303,24054,4249,15.0,0.8,15.5,53.8,5.1,46.2,90.0,4.1
3,CHELAN,53007,530072014,201453007,2014,72.0,0.86,11040,8884,2156,19.5,0.2,20.2,51.3,5.4,48.7,91.7,2.7
4,CLALLAM,53009,530092014,201453009,2014,73.0,0.81,22429,20336,2093,9.3,0.3,12.5,52.6,1.3,47.4,93.3,5.1


2. Formatting

We can give the right format now:

In [103]:
headerNames=medicare2014.columns
medicare2014[headerNames[4::]]=medicare2014[headerNames[4::]].apply(pd.to_numeric)

In [104]:
#check data types:
medicare2014.dtypes

county                                   object
state_and_county_fips_code               object
to_sort_by_county_and_year               object
to_sort_by_year_and_county               object
year                                      int64
average_age                             float64
average_hcc_score                       float64
beneficiaries_with_part_a_and_part_b      int64
ffs_beneficiaries                         int64
ma_beneficiaries                          int64
ma_participation_rate                   float64
percent_african_american                float64
percent_eligible_for_medicaid           float64
percent_female                          float64
percent_hispanic                        float64
percent_male                            float64
percent_non_hispanic_white              float64
percent_other_unknown                   float64
dtype: object

We can explore the variables:

In [105]:
medicare2014.describe(include='all') # to include categorical

Unnamed: 0,county,state_and_county_fips_code,to_sort_by_county_and_year,to_sort_by_year_and_county,year,average_age,average_hcc_score,beneficiaries_with_part_a_and_part_b,ffs_beneficiaries,ma_beneficiaries,ma_participation_rate,percent_african_american,percent_eligible_for_medicaid,percent_female,percent_hispanic,percent_male,percent_non_hispanic_white,percent_other_unknown
count,39,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,39.0,30.0,39.0,39.0,30.0,39.0,30.0,30.0
unique,39,39.0,39.0,39.0,,,,,,,,,,,,,,
top,LEWIS,53015.0,530772014.0,201453043.0,,,,,,,,,,,,,,
freq,1,1.0,1.0,1.0,,,,,,,,,,,,,,
mean,,,,,2014.0,71.384615,0.852051,28172.179487,18967.102564,9205.076923,19.910256,1.11,17.258974,51.820513,3.813333,48.187179,89.923333,5.176667
std,,,,,0.0,1.091001,0.074699,47845.550778,29514.437571,18921.743351,13.816348,1.40721,4.126597,1.842507,4.360618,1.842935,5.657932,2.599958
min,,,,,2014.0,69.0,0.67,617.0,600.0,17.0,2.8,0.2,8.4,47.4,0.8,44.9,73.9,2.4
25%,,,,,2014.0,71.0,0.79,4926.5,4529.5,379.5,8.85,0.4,14.8,50.95,1.525,46.75,87.975,3.6
50%,,,,,2014.0,71.0,0.86,11040.0,9335.0,1439.0,16.3,0.55,16.8,51.8,2.55,48.2,91.8,4.3
75%,,,,,2014.0,72.0,0.925,26892.0,21520.0,8036.0,31.0,1.15,20.15,53.25,3.35,49.05,94.075,6.525


In [106]:
medicare2014.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39 entries, 0 to 38
Data columns (total 18 columns):
county                                  39 non-null object
state_and_county_fips_code              39 non-null object
to_sort_by_county_and_year              39 non-null object
to_sort_by_year_and_county              39 non-null object
year                                    39 non-null int64
average_age                             39 non-null float64
average_hcc_score                       39 non-null float64
beneficiaries_with_part_a_and_part_b    39 non-null int64
ffs_beneficiaries                       39 non-null int64
ma_beneficiaries                        39 non-null int64
ma_participation_rate                   39 non-null float64
percent_african_american                30 non-null float64
percent_eligible_for_medicaid           39 non-null float64
percent_female                          39 non-null float64
percent_hispanic                        30 non-null float64
percent_m

There are some missing values, but we will leave it so. So the last step will be just to save the file:

In [107]:
medicare2014.to_csv("medicare2014.csv",index=None)

### Case: Public education:

When you visit the [website](https://nces.ed.gov/ccd/) of the Common Core of Data from the US Department of Education, you can get a data set with detailed information on public schools at the state of Washington:

In [108]:
dataFile='https://github.com/EvansDataScience/data/raw/master/wapubs.xlsx'
schoolPub=pd.read_excel(dataFile) 

1. Looking for messiness:

In [109]:
schoolPub.head(20)

Unnamed: 0,National Center for Education Statistics,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,Unnamed: 25
0,CCD public school data 2014-2015 school year,,,,,,,,,,...,,,,,,,,,,
1,The file contains (2398) records based on your...,,,,,,,,,,...,,,,,,,,,,
2,NOTES:,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,[ † ] indicates that the data are not applicable.,,,,,,,,,,...,,,,,,,,,,
5,[ – ] indicates that the data are missing.,,,,,,,,,,...,,,,,,,,,,
6,[ ‡ ] indicates that the data do not meet NCES...,,,,,,,,,,...,,,,,,,,,,
7,,,,,,,,,,,...,,,,,,,,,,
8,"SEARCH CRITERIA: State: ""Washington"" - School ...",,,,,,,,,,...,,,,,,,,,,
9,NCES is not responsible for the manner in whic...,,,,,,,,,,...,,,,,,,,,,


The first row is not the beginning of the table. We need to skip 11 rows; but pay attention to what you are deleting, as if is telling you how missing values were coded.

In [110]:
schoolPub=pd.read_excel(dataFile,skiprows=11,na_values=['†','‡','–'])
schoolPub.head()

Unnamed: 0,NCES School ID,State School ID,NCES District ID,State District ID,Low Grade*,High Grade*,School Name,District,County Name*,Street Address,...,Locale*,Charter,Magnet*,Title I School*,Title 1 School Wide*,Students*,Teachers*,Student Teacher Ratio*,Free Lunch*,Reduced Lunch*
0,530486002475,1656,5304860,31025,06,8,10TH STREET SCHOOL,MARYSVILLE SCHOOL DISTRICT,SNOHOMISH COUNTY,7204 27TH AVE NE,...,Suburb: Midsize,No,,No,,167.0,7.3,22.9,20.0,7.0
1,530270001270,1646,5302700,6114,KG,12,49TH STREET ACADEMY,EVERGREEN SCHOOL DISTRICT (CLARK),CLARK COUNTY,14619B NE 49TH STREET,...,City: Midsize,No,,Yes,Yes,123.0,10.1,12.2,75.0,7.0
2,530910002602,4500,5309100,34033,09,12,A G WEST BLACK HILLS HIGH SCHOOL,TUMWATER SCHOOL DISTRICT,THURSTON COUNTY,7741 LITTLEROCK ROAD SW,...,City: Small,No,,No,,867.0,41.19,21.0,189.0,45.0
3,530003000001,2834,5300030,14005,PK,6,A J WEST ELEMENTARY,ABERDEEN SCHOOL DISTRICT,GRAYS HARBOR COUNTY,1801 BAY AVE.,...,Town: Remote,No,,Yes,Yes,410.0,27.63,14.8,330.0,21.0
4,530825002361,1533,5308250,32081,09,12,A-3 MULTIAGENCY ADOLESCENT PROG,SPOKANE SCHOOL DISTRICT,SPOKANE COUNTY,610 E NORTHFOOTHILLS DRIVE,...,City: Midsize,No,,No,,22.0,3.1,7.1,16.0,3.0


In [111]:
#checking the tail:
schoolPub.tail()

Unnamed: 0,NCES School ID,State School ID,NCES District ID,State District ID,Low Grade*,High Grade*,School Name,District,County Name*,Street Address,...,Locale*,Charter,Magnet*,Title I School*,Title 1 School Wide*,Students*,Teachers*,Student Teacher Ratio*,Free Lunch*,Reduced Lunch*
2393,530813003439,5315,5308130,17406,09,12,YOUTHSOURCE,TUKWILA SCHOOL DISTRICT,KING COUNTY,TUKWILA SCHOOL DISTRICT,...,City: Small,No,,,,0.0,,,0.0,0.0
2394,530696002530,4496,5306960,27003,PK,6,ZEIGER ELEMENTARY,PUYALLUP SCHOOL DISTRICT,PIERCE COUNTY,13008 94TH AVE E,...,Suburb: Large,No,,No,,823.0,44.53,18.5,221.0,58.0
2395,531017001719,2240,5310170,39205,09,12,ZILLAH HIGH SCHOOL,ZILLAH SCHOOL DISTRICT,YAKIMA COUNTY,1602 SECOND AVENUE,...,Town: Distant,No,,Yes,Yes,441.0,21.95,20.1,169.0,49.0
2396,531017001896,4221,5310170,39205,04,6,ZILLAH INTERMEDIATE SCHOOL,ZILLAH SCHOOL DISTRICT,YAKIMA COUNTY,303 SECOND AVENUE,...,Town: Distant,No,,Yes,Yes,319.0,20.38,15.7,145.0,42.0
2397,531017002502,4481,5310170,39205,07,8,ZILLAH MIDDLE SCHOOL,ZILLAH SCHOOL DISTRICT,YAKIMA COUNTY,1301 CUTLER WAY,...,Rural: Fringe,No,,Yes,Yes,196.0,12.65,15.5,91.0,19.0


The headers have blanks and symbols, getting rid of them here:

In [116]:
import re

pattern='\\*|\\s+'
nothing=''
schoolPub.columns=[re.sub(pattern,nothing,columnName) for columnName in schoolPub.columns]
schoolPub.columns

Index(['NCESSchoolID', 'StateSchoolID', 'NCESDistrictID', 'StateDistrictID',
       'LowGrade', 'HighGrade', 'SchoolName', 'District', 'CountyName',
       'StreetAddress', 'City', 'State', 'ZIP', 'ZIP4-digit', 'Phone',
       'LocaleCode', 'Locale', 'Charter', 'Magnet', 'TitleISchool',
       'Title1SchoolWide', 'Students', 'Teachers', 'StudentTeacherRatio',
       'FreeLunch', 'ReducedLunch'],
      dtype='object')

Clean names allow better exploring. Notice we solved the missing values above. You could have done this instead:

In [119]:
symbolsForNA=['†','‡','–'] 

import numpy as np  #numpy manages the nan for pandas
schoolPub.replace(symbolsForNA,np.nan,inplace=True) # in the whole data frame!!

2. Formatting

In [120]:
schoolPub.dtypes

NCESSchoolID             int64
StateSchoolID            int64
NCESDistrictID           int64
StateDistrictID          int64
LowGrade                object
HighGrade               object
SchoolName              object
District                object
CountyName              object
StreetAddress           object
City                    object
State                   object
ZIP                      int64
ZIP4-digit             float64
Phone                    int64
LocaleCode             float64
Locale                  object
Charter                 object
Magnet                 float64
TitleISchool            object
Title1SchoolWide        object
Students               float64
Teachers               float64
StudentTeacherRatio    float64
FreeLunch              float64
ReducedLunch           float64
dtype: object

Even though we cleaned the missing values, there might be more in the text columns that may be hidden. Obviously, 'SchoolName','District','CountyName','StreetAddress','City','State' are text, but the other are possibly categorical.

So let me explore all the other ones, which are of type _object_:

In [121]:
notUsed=['SchoolName','District','CountyName','StreetAddress','City','State']
 
# These are the ones without the obvious text columns
schoolPub.drop(notUsed,axis=1).head()

Unnamed: 0,NCESSchoolID,StateSchoolID,NCESDistrictID,StateDistrictID,LowGrade,HighGrade,ZIP,ZIP4-digit,Phone,LocaleCode,Locale,Charter,Magnet,TitleISchool,Title1SchoolWide,Students,Teachers,StudentTeacherRatio,FreeLunch,ReducedLunch
0,530486002475,1656,5304860,31025,06,8,98271,,3606530665,22.0,Suburb: Midsize,No,,No,,167.0,7.3,22.9,20.0,7.0
1,530270001270,1646,5302700,6114,KG,12,98682,6308.0,3606046700,12.0,City: Midsize,No,,Yes,Yes,123.0,10.1,12.2,75.0,7.0
2,530910002602,4500,5309100,34033,09,12,98512,7427.0,3607097800,13.0,City: Small,No,,No,,867.0,41.19,21.0,189.0,45.0
3,530003000001,2834,5300030,14005,PK,6,98520,5510.0,3605382131,33.0,Town: Remote,No,,Yes,Yes,410.0,27.63,14.8,330.0,21.0
4,530825002361,1533,5308250,32081,09,12,99207,,5094587466,12.0,City: Midsize,No,,No,,22.0,3.1,7.1,16.0,3.0


In [122]:
# # These are the ones without the obvious text columns, but of the type 'object':
schoolPub.drop(notUsed,axis=1).select_dtypes(include='object').head()

Unnamed: 0,LowGrade,HighGrade,Locale,Charter,TitleISchool,Title1SchoolWide
0,06,8,Suburb: Midsize,No,No,
1,KG,12,City: Midsize,No,Yes,Yes
2,09,12,City: Small,No,No,
3,PK,6,Town: Remote,No,Yes,Yes
4,09,12,City: Midsize,No,No,


We need to see the categories there:

In [123]:
schoolPub.drop(notUsed,axis=1).select_dtypes(include='object').apply(set).tolist()

[{'01',
  '02',
  '03',
  '04',
  '05',
  '06',
  '07',
  '08',
  '09',
  '10',
  '11',
  '12',
  'KG',
  'PK',
  nan},
 {'01',
  '02',
  '03',
  '04',
  '05',
  '06',
  '07',
  '08',
  '09',
  '10',
  '11',
  '12',
  '13',
  'KG',
  'PK',
  nan},
 {'City: Large',
  'City: Midsize',
  'City: Small',
  'N',
  'Rural: Distant',
  'Rural: Fringe',
  'Rural: Remote',
  'Suburb: Large',
  'Suburb: Midsize',
  'Suburb: Small',
  'Town: Distant',
  'Town: Fringe',
  'Town: Remote'},
 {'No', 'Yes'},
 {'No', 'Yes', nan},
 {'No', 'Yes', nan}]

We need to take care of the missing value '**N**':

In [124]:
schoolPub.Locale.value_counts(dropna=False)

Suburb: Large      580
City: Small        368
City: Midsize      246
Rural: Fringe      199
Suburb: Midsize    186
Rural: Distant     172
Town: Distant      157
Rural: Remote      139
City: Large        111
Town: Fringe        97
Town: Remote        85
Suburb: Small       57
N                    1
Name: Locale, dtype: int64

Then:

In [125]:
import numpy as np  #numpy manages the nan for pandas

schoolPub.replace(['N'],np.nan,inplace=True) # in the whole data frame!!

In [126]:
# So:
schoolPub.Locale.value_counts(dropna=False)

Suburb: Large      580
City: Small        368
City: Midsize      246
Rural: Fringe      199
Suburb: Midsize    186
Rural: Distant     172
Town: Distant      157
Rural: Remote      139
City: Large        111
Town: Fringe        97
Town: Remote        85
Suburb: Small       57
NaN                  1
Name: Locale, dtype: int64

Another important step could be to give add some text to make the school grades a recognizable ordering (considering the file will be read in R:

In [127]:
# this is wrong:
'PK'<'KG'<'01'

False

In [128]:
# this is OK:
'-1 PK'<'0 KG'<'01'

True

In [129]:
# using replace:

schoolPub.replace({'PK':"-1 PK", "KG":"0 KG"},inplace=True)

Unless you want to recode other [variables](https://nces.ed.gov/programs/edge/docs/LOCALE_CLASSIFICATIONS.pdf), we could save this file:

In [130]:
schoolPub.to_csv("schoolPub.csv",index=None)

### Case: SNAP

In [131]:
import pandas as pd
dataFile="https://github.com/EvansDataScience/data/raw/master/cntysnap.xls"
snapBen=pd.read_excel(dataFile)

In [132]:
# first rows:
snapBen.head()

Unnamed: 0,Table with column headers in row 3,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,...,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22
0,Table: County SNAP benefits recipients\nSourc...,,,,,,,,,,...,,,,,,,,,,
1,State FIPS code,County FIPS code,Name,July 2013,July 2012,July 2011,July 2010,July 2009,July 2008,July 2007,...,July 2003,July 2002,July 2001,July 2000,July 1999,July 1998,July 1997,July 1995,July 1993,July 1989
2,1,0,Alabama,905604,914740,905658,859695,780427,642427,561024,...,491910,463998,435235,405325,396931,409160,434632,510271,561128,435296
3,1,1,"Autauga County, AL",8055,8079,8060,7593,6907,5423,4686,...,3594,3352,3035,2814,2700,2589,2771,3611,4608,3739
4,1,3,"Baldwin County, AL",24402,23169,22184,20604,16545,10848,8627,...,5659,4945,4129,4178,4372,4850,5201,6928,8053,5548


We need to skip some rows:

In [133]:
# skipping:

snapBen=pd.read_excel(dataFile,skiprows=2)
snapBen.head()

Unnamed: 0,State FIPS code,County FIPS code,Name,July 2013,July 2012,July 2011,July 2010,July 2009,July 2008,July 2007,...,July 2003,July 2002,July 2001,July 2000,July 1999,July 1998,July 1997,July 1995,July 1993,July 1989
0,1,0,Alabama,905604.0,914740.0,905658.0,859695.0,780427.0,642427.0,561024.0,...,491910.0,463998.0,435235.0,405325.0,396931.0,409160.0,434632.0,510271.0,561128.0,435296.0
1,1,1,"Autauga County, AL",8055.0,8079.0,8060.0,7593.0,6907.0,5423.0,4686.0,...,3594.0,3352.0,3035.0,2814.0,2700.0,2589.0,2771.0,3611.0,4608.0,3739.0
2,1,3,"Baldwin County, AL",24402.0,23169.0,22184.0,20604.0,16545.0,10848.0,8627.0,...,5659.0,4945.0,4129.0,4178.0,4372.0,4850.0,5201.0,6928.0,8053.0,5548.0
3,1,5,"Barbour County, AL",6740.0,6843.0,6465.0,6324.0,6340.0,5672.0,5389.0,...,4981.0,4959.0,4637.0,4415.0,4490.0,4788.0,4947.0,5268.0,5499.0,5056.0
4,1,7,"Bibb County, AL",4175.0,4269.0,4336.0,4152.0,3848.0,2967.0,2577.0,...,2116.0,1953.0,1765.0,1685.0,1631.0,1776.0,1999.0,2085.0,2503.0,2056.0


In [134]:
# check the tail
snapBen.tail()

Unnamed: 0,State FIPS code,County FIPS code,Name,July 2013,July 2012,July 2011,July 2010,July 2009,July 2008,July 2007,...,July 2003,July 2002,July 2001,July 2000,July 1999,July 1998,July 1997,July 1995,July 1993,July 1989
3196,56,37,"Sweetwater County, WY",2186.0,2073.0,1906.0,2000.0,1731.0,951.0,720.0,...,1435.0,1434.0,1152.0,1151.0,1108.0,1251.0,1511.0,2047.0,2031.0,1736.0
3197,56,39,"Teton County, WY",268.0,243.0,263.0,184.0,183.0,109.0,88.0,...,106.0,107.0,95.0,81.0,77.0,100.0,129.0,159.0,159.0,85.0
3198,56,41,"Uinta County, WY",1636.0,1743.0,1673.0,1747.0,1670.0,1059.0,1031.0,...,1271.0,1208.0,1094.0,1108.0,1106.0,1037.0,1058.0,1278.0,1267.0,856.0
3199,56,43,"Washakie County, WY",604.0,603.0,675.0,713.0,613.0,543.0,456.0,...,476.0,443.0,286.0,401.0,343.0,338.0,374.0,619.0,743.0,666.0
3200,56,45,"Weston County, WY",800.0,329.0,365.0,429.0,379.0,279.0,236.0,...,267.0,267.0,284.0,216.0,260.0,257.0,256.0,275.0,271.0,300.0


In [135]:
# checking names:
snapBen.columns

Index(['State FIPS code', 'County FIPS code', 'Name', 'July 2013', 'July 2012',
       'July 2011', 'July 2010', 'July 2009', 'July 2008', 'July 2007',
       'July 2006', 'July 2005', 'July 2004', 'July 2003', 'July 2002',
       'July 2001', 'July 2000', 'July 1999', 'July 1998', 'July 1997',
       'July 1995', 'July 1993', 'July 1989'],
      dtype='object')

In [136]:
# getting rid of blanks:

pattern='\\s+'
nothing=''
snapBen.columns=[re.sub(pattern,nothing,columnName) for columnName in snapBen.columns]

There is a zero FIPS code, take a look:

In [137]:
snapBen[snapBen['CountyFIPScode']==0]

Unnamed: 0,StateFIPScode,CountyFIPScode,Name,July2013,July2012,July2011,July2010,July2009,July2008,July2007,...,July2003,July2002,July2001,July2000,July1999,July1998,July1997,July1995,July1993,July1989
0,1,0,Alabama,905604.0,914740.0,905658.0,859695.0,780427.0,642427.0,561024.0,...,491910.0,463998.0,435235.0,405325.0,396931.0,409160.0,434632.0,510271.0,561128.0,435296.0
68,2,0,Alaska,82561.0,84098.0,82354.0,76371.0,67591.0,57056.0,51754.0,...,43794.0,43863.0,38909.0,29473.0,33529.0,34026.0,35981.0,36339.0,47059.0,19569.0
102,4,0,Arizona,1058025.0,1116065.0,1123066.0,1049272.0,986413.0,752722.0,600549.0,...,522009.0,442945.0,355735.0,277436.0,258007.0,260747.0,311157.0,434543.0,492926.0,272507.0
118,5,0,Arkansas,496260.0,504549.0,498981.0,483581.0,456286.0,397887.0,376866.0,...,336668.0,303177.0,277277.0,251901.0,247206.0,254193.0,256490.0,272203.0,283001.0,220392.0
194,6,0,California,4289489.0,4124878.0,3903859.0,3574897.0,3106465.0,2528337.0,2155571.0,...,1807695.0,1689369.0,1699342.0,1690077.0,1882388.0,2071540.0,2368968.0,3166792.0,3013167.0,1780476.0
253,8,0,Colorado,505745.0,506082.0,484677.0,441346.0,387515.0,295668.0,249454.0,...,236350.0,198401.0,171983.0,152198.0,159713.0,176344.0,196140.0,245144.0,270835.0,206645.0
318,9,0,Connecticut,436118.0,419806.0,398405.0,371356.0,317717.0,244596.0,221603.0,...,192577.0,177267.0,164793.0,157935.0,167680.0,182236.0,200287.0,223945.0,217628.0,115384.0
327,10,0,Delaware,151017.0,152235.0,146157.0,130027.0,107036.0,85344.0,72404.0,...,53107.0,44427.0,37264.0,31121.0,34293.0,39855.0,47229.0,56829.0,58084.0,29441.0
331,11,0,District of Columbia,143182.0,144251.0,140450.0,131366.0,114055.0,98911.0,88504.0,...,85441.0,79029.0,73800.0,74711.0,82085.0,84001.0,86850.0,93358.0,86316.0,57983.0
333,12,0,Florida,3515731.0,3529190.0,3279197.0,2988351.0,2456494.0,1789890.0,1381870.0,...,1140247.0,1033097.0,963846.0,874989.0,892560.0,942226.0,1031032.0,1378539.0,1479935.0,684026.0


Those are rows about States. I will keep the counties:

In [138]:
snapBenUSCounties=snapBen[snapBen['CountyFIPScode']!=0]

In [139]:
# checking data types:
snapBenUSCounties.dtypes

StateFIPScode       int64
CountyFIPScode      int64
Name               object
July2013          float64
July2012          float64
July2011          float64
July2010          float64
July2009          float64
July2008          float64
July2007          float64
July2006          float64
July2005          float64
July2004          float64
July2003          float64
July2002          float64
July2001          float64
July2000          float64
July1999          float64
July1998          float64
July1997          float64
July1995          float64
July1993          float64
July1989          float64
dtype: object

The counties tell you to what State they belong, so we could use that to create a new column. Let's see a simple example on how to get information from a text:

In [140]:
# using split,a function for strings:
'Autauga County, AL'.split(', ') # notice the space after the comma
# you get a list:

['Autauga County', 'AL']

The **split**, in this case, returns the state in the second position of the list (index=1), then:

In [141]:
# saving every second element for each element in the column:
states=[element.split(', ')[1] for element in snapBenUSCounties.Name]

# make that list a new column
snapBenUSCounties=snapBenUSCounties.assign(StateName=states)

# checking:
snapBenUSCounties

Unnamed: 0,StateFIPScode,CountyFIPScode,Name,July2013,July2012,July2011,July2010,July2009,July2008,July2007,...,July2002,July2001,July2000,July1999,July1998,July1997,July1995,July1993,July1989,StateName
1,1,1,"Autauga County, AL",8055.0,8079.0,8060.0,7593.0,6907.0,5423.0,4686.0,...,3352.0,3035.0,2814.0,2700.0,2589.0,2771.0,3611.0,4608.0,3739.0,AL
2,1,3,"Baldwin County, AL",24402.0,23169.0,22184.0,20604.0,16545.0,10848.0,8627.0,...,4945.0,4129.0,4178.0,4372.0,4850.0,5201.0,6928.0,8053.0,5548.0,AL
3,1,5,"Barbour County, AL",6740.0,6843.0,6465.0,6324.0,6340.0,5672.0,5389.0,...,4959.0,4637.0,4415.0,4490.0,4788.0,4947.0,5268.0,5499.0,5056.0,AL
4,1,7,"Bibb County, AL",4175.0,4269.0,4336.0,4152.0,3848.0,2967.0,2577.0,...,1953.0,1765.0,1685.0,1631.0,1776.0,1999.0,2085.0,2503.0,2056.0,AL
5,1,9,"Blount County, AL",9037.0,9368.0,9425.0,8967.0,7283.0,5695.0,4690.0,...,3066.0,2637.0,2436.0,2593.0,2789.0,2926.0,3373.0,3665.0,2392.0,AL
6,1,11,"Bullock County, AL",2966.0,3023.0,3065.0,3039.0,2847.0,2553.0,2393.0,...,2242.0,2226.0,2124.0,2094.0,2124.0,2261.0,2666.0,2901.0,2564.0,AL
7,1,13,"Butler County, AL",5451.0,5719.0,5554.0,5402.0,5390.0,4378.0,3672.0,...,3342.0,3361.0,3407.0,3307.0,3235.0,3251.0,3748.0,4710.0,4221.0,AL
8,1,15,"Calhoun County, AL",25918.0,27006.0,26151.0,25252.0,22691.0,18382.0,15966.0,...,12235.0,11353.0,10339.0,10321.0,10910.0,11380.0,13225.0,14637.0,10161.0,AL
9,1,17,"Chambers County, AL",7624.0,7577.0,7942.0,8488.0,8263.0,6696.0,5740.0,...,4419.0,3952.0,3499.0,3659.0,3716.0,3756.0,4821.0,5153.0,3581.0,AL
10,1,19,"Cherokee County, AL",5434.0,5731.0,5671.0,5448.0,4937.0,4081.0,3645.0,...,2705.0,2540.0,2061.0,2011.0,2054.0,1971.0,1955.0,1892.0,1652.0,AL


The new column was created. We could get rid of the state information from the counties column:

In [142]:
# just keep county names
counties=[element.split(', ')[0] for element in snapBenUSCounties.Name]
snapBenUSCounties=snapBenUSCounties.assign(Name=counties)

In [143]:
# quick look:

snapBenUSCounties.head() # last column will be ate the end...

Unnamed: 0,StateFIPScode,CountyFIPScode,Name,July2013,July2012,July2011,July2010,July2009,July2008,July2007,...,July2002,July2001,July2000,July1999,July1998,July1997,July1995,July1993,July1989,StateName
1,1,1,Autauga County,8055.0,8079.0,8060.0,7593.0,6907.0,5423.0,4686.0,...,3352.0,3035.0,2814.0,2700.0,2589.0,2771.0,3611.0,4608.0,3739.0,AL
2,1,3,Baldwin County,24402.0,23169.0,22184.0,20604.0,16545.0,10848.0,8627.0,...,4945.0,4129.0,4178.0,4372.0,4850.0,5201.0,6928.0,8053.0,5548.0,AL
3,1,5,Barbour County,6740.0,6843.0,6465.0,6324.0,6340.0,5672.0,5389.0,...,4959.0,4637.0,4415.0,4490.0,4788.0,4947.0,5268.0,5499.0,5056.0,AL
4,1,7,Bibb County,4175.0,4269.0,4336.0,4152.0,3848.0,2967.0,2577.0,...,1953.0,1765.0,1685.0,1631.0,1776.0,1999.0,2085.0,2503.0,2056.0,AL
5,1,9,Blount County,9037.0,9368.0,9425.0,8967.0,7283.0,5695.0,4690.0,...,3066.0,2637.0,2436.0,2593.0,2789.0,2926.0,3373.0,3665.0,2392.0,AL


We can have a better column order:

In [144]:
oldNames=snapBenUSCounties.columns.tolist()
oldNames

['StateFIPScode',
 'CountyFIPScode',
 'Name',
 'July2013',
 'July2012',
 'July2011',
 'July2010',
 'July2009',
 'July2008',
 'July2007',
 'July2006',
 'July2005',
 'July2004',
 'July2003',
 'July2002',
 'July2001',
 'July2000',
 'July1999',
 'July1998',
 'July1997',
 'July1995',
 'July1993',
 'July1989',
 'StateName']

In [145]:
newNames=[*oldNames[:2],oldNames[-1],*oldNames[2:-1]]
newNames          

['StateFIPScode',
 'CountyFIPScode',
 'StateName',
 'Name',
 'July2013',
 'July2012',
 'July2011',
 'July2010',
 'July2009',
 'July2008',
 'July2007',
 'July2006',
 'July2005',
 'July2004',
 'July2003',
 'July2002',
 'July2001',
 'July2000',
 'July1999',
 'July1998',
 'July1997',
 'July1995',
 'July1993',
 'July1989']

In [146]:
# reordering columns:

snapBenUSCounties=snapBenUSCounties[newNames]
snapBenUSCounties.head()

Unnamed: 0,StateFIPScode,CountyFIPScode,StateName,Name,July2013,July2012,July2011,July2010,July2009,July2008,...,July2003,July2002,July2001,July2000,July1999,July1998,July1997,July1995,July1993,July1989
1,1,1,AL,Autauga County,8055.0,8079.0,8060.0,7593.0,6907.0,5423.0,...,3594.0,3352.0,3035.0,2814.0,2700.0,2589.0,2771.0,3611.0,4608.0,3739.0
2,1,3,AL,Baldwin County,24402.0,23169.0,22184.0,20604.0,16545.0,10848.0,...,5659.0,4945.0,4129.0,4178.0,4372.0,4850.0,5201.0,6928.0,8053.0,5548.0
3,1,5,AL,Barbour County,6740.0,6843.0,6465.0,6324.0,6340.0,5672.0,...,4981.0,4959.0,4637.0,4415.0,4490.0,4788.0,4947.0,5268.0,5499.0,5056.0
4,1,7,AL,Bibb County,4175.0,4269.0,4336.0,4152.0,3848.0,2967.0,...,2116.0,1953.0,1765.0,1685.0,1631.0,1776.0,1999.0,2085.0,2503.0,2056.0
5,1,9,AL,Blount County,9037.0,9368.0,9425.0,8967.0,7283.0,5695.0,...,3527.0,3066.0,2637.0,2436.0,2593.0,2789.0,2926.0,3373.0,3665.0,2392.0


In [147]:
# JUST SAVING...
snapBenUSCounties.to_csv("snapBenUSCounties.csv",index=None)

### Case: Multiple data sets

In [148]:
corruptLink='https://raw.githubusercontent.com/EvansDataScience/data/master/corruption.csv'
econoLink='https://raw.githubusercontent.com/EvansDataScience/data/master/economic.csv'
enviroLink='https://raw.githubusercontent.com/EvansDataScience/data/master/environment.csv'
pressLink='https://raw.githubusercontent.com/EvansDataScience/data/master/pressfreedom.csv'

* The _corruptlink_ has data about the _Corruption Perception Index_ (CPI) produced by [Transparency International](https://www.transparency.org/).

* The _econoLink_ has data about the _Economic Freedom Index_ (EFI) produced by [Fraser Institute](https://www.fraserinstitute.org).

* The _enviroLink_ has data about the _Environment Performance Index_ (EPI) produced by [Yale University and Columbia University in collaboration with the World Economic Forum](https://epi.envirocenter.yale.edu/).

* The _pressLink_ has data about the _World Press Freedom Index_ (WPFI) produced by [Reporters Without Borders](https://rsf.org/en/world-press-freedom-index).


In this case, I want to join them (not concatenate):

In [149]:
import pandas as pd
corrupt=pd.read_csv(corruptLink,encoding='Latin-1')
econo=pd.read_csv(econoLink,encoding='Latin-1')
enviro=pd.read_csv(enviroLink,encoding='Latin-1')
press=pd.read_csv(pressLink,encoding='Latin-1')

As each data set has a differing amount of rows (countries), and possibly a different way to name each one, the result will be far from perfect:

In [150]:
join1=pd.merge(corrupt,econo)
join2=pd.merge(press,enviro)
indexes=pd.merge(join1,join2)

As always it is good to verify the data types:

In [151]:
indexes.dtypes

Country             object
corruptionIndex      int64
ISO                 object
scoreEconomy       float64
scorepress         float64
presscat            object
environment        float64
environmentCat       int64
dtype: object

And check descriptives:

In [152]:
indexes.describe(include='all') 

Unnamed: 0,Country,corruptionIndex,ISO,scoreEconomy,scorepress,presscat,environment,environmentCat
count,129,129.0,129,129.0,129.0,129,129.0,129.0
unique,129,,129,,,3,,
top,Albania,,CHN,,,Medium,,
freq,1,,1,,,71,,
mean,,45.418605,,6.829457,31.097597,,69.393023,0.542636
std,,19.296898,,0.907765,13.12522,,14.732355,0.500121
min,,14.0,,2.92,8.59,,37.1,0.0
25%,,31.0,,6.32,22.66,,59.25,0.0
50%,,40.0,,6.92,29.92,,70.84,1.0
75%,,58.0,,7.51,40.43,,81.26,1.0


In [153]:
indexes.head()

Unnamed: 0,Country,corruptionIndex,ISO,scoreEconomy,scorepress,presscat,environment,environmentCat
0,New Zealand,90,NZL,8.48,10.01,High,88.0,1
1,Denmark,90,DNK,7.77,8.89,High,89.21,1
2,Finland,89,FIN,7.75,8.59,High,90.68,1
3,Sweden,88,SWE,7.65,12.33,High,90.43,1
4,Switzerland,86,CHE,8.44,11.76,High,86.93,1


There is some formatting needed:

Let's order it:

In [154]:
oldCols=indexes.columns.tolist()
oldCols

['Country',
 'corruptionIndex',
 'ISO',
 'scoreEconomy',
 'scorepress',
 'presscat',
 'environment',
 'environmentCat']

When we do not have slices, there is extra work:

In [155]:
numericIndex=[oldCols[i] for i in [1,3,4,6]]
numericIndex

['corruptionIndex', 'scoreEconomy', 'scorepress', 'environment']

In [156]:
newValues=[oldCols[0],oldCols[2],*numericIndex,oldCols[5],oldCols[7]]
newValues

['Country',
 'ISO',
 'corruptionIndex',
 'scoreEconomy',
 'scorepress',
 'environment',
 'presscat',
 'environmentCat']

Then, the new order will be:

In [157]:
indexes=indexes[newValues]
indexes.head()

Unnamed: 0,Country,ISO,corruptionIndex,scoreEconomy,scorepress,environment,presscat,environmentCat
0,New Zealand,NZL,90,8.48,10.01,88.0,High,1
1,Denmark,DNK,90,7.77,8.89,89.21,High,1
2,Finland,FIN,89,7.75,8.59,90.68,High,1
3,Sweden,SWE,88,7.65,12.33,90.43,High,1
4,Switzerland,CHE,86,8.44,11.76,86.93,High,1


There are several numeric values. Let's see a summary:

In [158]:
indexes.describe()

Unnamed: 0,corruptionIndex,scoreEconomy,scorepress,environment,environmentCat
count,129.0,129.0,129.0,129.0,129.0
mean,45.418605,6.829457,31.097597,69.393023,0.542636
std,19.296898,0.907765,13.12522,14.732355,0.500121
min,14.0,2.92,8.59,37.1,0.0
25%,31.0,6.32,22.66,59.25,0.0
50%,40.0,6.92,29.92,70.84,1.0
75%,58.0,7.51,40.43,81.26,1.0
max,90.0,8.81,80.96,90.68,1.0


It is important to find some monotony issues in these values:

In [160]:
% matplotlib inline
import matplotlib.pyplot as plt
pd.plotting.scatter_matrix(indexes.iloc[:,2:6])
plt.show()

UsageError: Line magic function `%` not found.


Score press is negatively correlated to the rest. That means that the score for that column needs to be reversed:

In [161]:
# creating reversing function:
def reverse(aColumn):
    return max(aColumn) - aColumn + min(aColumn)

In [162]:
# reversing using function:
indexes.scorepress=reverse(indexes.scorepress)

We should see a different result:

In [163]:
pd.plotting.scatter_matrix(indexes.iloc[:,2:6])
plt.show()

NameError: name 'plt' is not defined

The variable _presscat_ needs to be an ordinal factor.

In [164]:
indexes['presscat'].value_counts()

Medium    71
High      35
Low       23
Name: presscat, dtype: int64

In [165]:
indexes['presscat'].replace({'Medium':2, "High":3, "Low":1},inplace=True)

In [166]:
indexes['presscat'].value_counts(sort=False)

1    23
2    71
3    35
Name: presscat, dtype: int64

The numbers will help R users when they set it as an ordinal. You can convert them to ordinal, but that information will be lost in R.

In [167]:
indexes.head()

Unnamed: 0,Country,ISO,corruptionIndex,scoreEconomy,scorepress,environment,presscat,environmentCat
0,New Zealand,NZL,90,8.48,79.54,88.0,3,1
1,Denmark,DNK,90,7.77,80.66,89.21,3,1
2,Finland,FIN,89,7.75,80.96,90.68,3,1
3,Sweden,SWE,88,7.65,77.22,90.43,3,1
4,Switzerland,CHE,86,8.44,77.79,86.93,3,1


We are proposing that the categories coded as numbers follow an asceding format, then let's check if _environmentCat_ should be changed:

In [168]:
indexes['environmentCat'].value_counts()

1    70
0    59
Name: environmentCat, dtype: int64

As there is no need for that, just save the file:

In [None]:
# indexes.to_csv("indexes.csv",index=None)


____

* [Go to page beginning](#beginning)
* [Go to REPO in Github](https://github.com/EvansDataScience/ComputationalThinking_Gov_2)
* [Go to Course schedule](https://evansdatascience.github.io/GovernanceAnalytics/)