## Andrea Calef

    💁‍♂️ My details: a.calef@uea.ac.uk; office hours Wednesdays 4-6pm on Teams. Or meet live by appointment!

The materials for this week are available as a Jupyter notebook. Jupyter notebooks mix rich text with runnable python code. So, you can follow along with this lecture, run the python examples, and even add your own notes and code. To do this go to

https://mybinder.org/v2/gh/tturocy/eco7026a/HEAD

Alternatively, you can copy and paste code from here into the python command line or an IDE such as Spyder.

### Using Jupyter notebooks

To get your own copy of this notebook, choose **File** above then **Download**.

When you have done that, click in the field below, and either press the play button or type Shift+Enter. This executes the Python cell.

## Notes: necessary libraries to replicate this lecture.

pip list 
<br>
#print(pd.__version__)

import numpy as np
<br>
import pandas as pd
<br>
import os

#pip install pandas_datareader

#pip install pyarrow 
<br>
import pyarrow as pa
<br>
import pyarrow.parquet as pq

According to Pythonic and PEP-8, all the libraries should be recalled at the beginning of your .py/.ipynb file. 

In this lecture we will upload a couple of .csv files and, at the end of it we will save part of the work via files having different extensions. Before starting any work, it is worth checking the directory ...

In [None]:
pwd 

Or after importing **os** ...

In [None]:
#import os
os.getcwd()

And modify the directory as needed with: 
<br>
cd C:\Your\Directory\Until\the\chosen\folder\
<br>
os.chdir('C:/Your/Directory/Until/the/chosen/folder/')
<br>
Let us try below!

Today we are going to look at **pandas** and **numpy**, which are libraries for data analysis and numerical programming, respectively.
<br>
<br>
**numpy** and **pandas** provide some useful data types of their own, which rather speed up the process of data analysis.
* numpy provides mathematical functions, multidimensional arrays. 
* pandas builds on numpy and helps us handle data for analysis.

## numpy

The main functionality of **numpy** is to help process **arrays**. 
<br>
<br>
Arrays provide a way to store data. In the previous week, we looked at lists. Arrays are a way of taking that concept and making it multi-dimensional. This concept is the same as a **matrix** in Mathematics.
<br>
<br>
**numpy** is a library that stores many functions. When you wish to recall a particular one, a dot (.) is needed. 
It is worth giving a look at the **documentation** of this library by clicking on the following <a href='https://numpy.org/doc/stable/numpy-user.pdf'>hyperlink</a>.

In [None]:
#import numpy as np 
a = np.array([[1,2],[3,4]])
print(a)
print(a*3)


In [None]:
m = np.array([[2,1],[1,2]])
eigen_value, eigen_vector = np.linalg.eig(m)
print(eigen_value)
print(eigen_vector)

Let us import the function to do some linear algebra.

In [None]:
from np import linalg as la

We perform some matricial operations now. For a refresher, please click <a href='https://twister.caps.ou.edu/OBAN2019/Intro_FEM_files/IFEM.AppC.pdf'>here</a>. 

In [None]:
from numpy import linalg as la 
eigen_value_, eigen_vector_ = la.eig(m)
print(eigen_value_)
print(eigen_vector_)

In [None]:
print(la.det(a))
det = la.det(a)
print(np.round(det))
b = la.inv(a)
print(b)
print(np.matmul(b,a))
print(np.round(np.matmul(b,a)))

## pandas

**pandas** is a library that builds on the mathematical tools introduced by numpy to provide a comprehensive set of statistics tools, in a similar way to software like R or STATA.
<br>
<br>
We can use a public dataset as a starting point for producing some real statistics.
<br>
<br>
For example, let us go on the <a href='https://fred.stlouisfed.org/'>**FRED**</a> – Federal Reserve Economic Data's website, which provides some useful macroeconomic variables we can use for analysis.
<br>
<br>
We will have a look at the UK unemployment rate. 
<br>
<br>
The FRED website can supply us with a CSV file of the relevant variable, which we can work with if we put it in the folder in which Python is running. (We can also use Python’s **requests** module to get the file.)

### pandas: Data structures

Before loading our data, we need to learn about the data types available from the pandas library. 
<br>
<br>
pandas main data types are bool and datetime64, number64, object. 
<br>
<br>
pandas data types can be Series or DataFrame. 
<br>
<br>
A **Series** is like a column of data, whereas a DataFrame is a collection of these Series. 
<br>
<br>
The distinction is similar to that between a column of a spreadsheet, and the whole spreadsheet itself. 

In [None]:
#import pandas as pd
s = pd.Series([3,2,3,4], name='inflation')
print(s)
s = pd.Series([3,2,3,4], name='inflation', index = [2008, 2009, 2010, 2011])
print(s[2008])
print(s.loc[2008:2010])

The Series object is much like a numpy array, but it supports the use of labeling our data with an index <u>and</u> some descriptive statistics features.

In [None]:
print(s.describe())
print(s.count())
print(s.mean())
print(s.median())
print(s.std())
print(s.min())
print(s.quantile(0.25))
print(s.max())

print(s.mode())
print(s.skew())
print(s.kurt())

If the file is stored online on GitHub, just import the file, careful to its extension. 
<br>
If the file is stored on your computer
<br>
import os 
<br>
os. getcwd()
<br>
#os.chdir('C:/Users/andre/Dropbox/University/Teaching/UEA/Module organiser/ECO-7026A Programming and Data Analytics for Behavioural Economists/Lecture 02')


In [51]:
url = 'https://raw.githubusercontent.com/tturocy/eco7026a/main/week2/LRHUTTTTGBM156S.csv'
df = pd.read_csv(url)

In [None]:
print(df)

In [52]:
usa_data = pd.read_csv('https://raw.githubusercontent.com/tturocy/eco7026a/main/week2/LRHUTTTTUSM156S.csv')

Let us give a look at the first four observations ...

In [None]:
usa_data.head(4)

... and the last four ones. 

In [None]:
usa_data.tail(4)

In [None]:
world_data = df.merge(usa_data,sort=True) # The merge considers just the dates in common. 
world_data.head(8) 

In [None]:
world_data1 = df.merge(usa_data,how='outer') # The merge considers the longest time period and generates Nan for the variables with shorter time periods. 
world_data1.head(8) 

We notice that data are not well sorted. Why? What can we do to correct?

In [None]:
world_data1 = world_data1.sort_values(by=['DATE']) # This column sorts world_data1 DataFrame by column Date. 
world_data1.head(8) 

In [None]:
world_data1.index = range(len(world_data1.index))
world_data1.head(8)

Alternative solution 

world_data1.reset_index(drop=True, inplace = True) # This command resets the index. The change is permanement with the option inplace=True. 
world_data1.head(8) 

There are other two solutions, one is written below. See again above to look for the other one ...

In [53]:
world_data1 = df.merge(usa_data,how='outer', sort = True) # This is ok, only because data sets were already sorted by Date.
world_data1.head(8) 

Unnamed: 0,DATE,LRHUTTTTGBM156S,LRHUTTTTUSM156S
0,1960-01-01,,5.2
1,1960-02-01,,4.8
2,1960-03-01,,5.4
3,1960-04-01,,5.2
4,1960-05-01,,5.1
5,1960-06-01,,5.4
6,1960-07-01,,5.5
7,1960-08-01,,5.6


Let us rename the two main variables of our data set. 

In [None]:
world_data.rename(columns={'LRHUTTTTGBM156S': 'uk_unemployment', 'LRHUTTTTUSM156S': 'usa_nemployment'})

Do you notice anything above? 
<br>
Let us see all the data below.

In [None]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None): print(world_data)

Nice, but ...

In [None]:
world_data.rename(columns={'LRHUTTTTGBM156S': 'uk_unemployment', 'LRHUTTTTUSM156S': 'usa_unemployment'}, inplace = True)

In [None]:
print(world_data['usa_unemployment'])


Without the option "inplace = True", the change in the name of the column is temporary and not fixed in the dataframe.

**DataFrame.assign()** adds one or more columns. It can contain functions and/or operations inside.

In [54]:
world_data = world_data.assign(Source = 'FRED', Diff_U = world_data.uk_unemployment - world_data.usa_unemployment) 

Let us better understand the data set we are using before undertaking any analysis.

In [None]:
world_data.dtypes 

In [None]:
world_data.shape 

In [None]:
world_data.info()

In [None]:
world_data1.info()

In [None]:
world_data1.isna()

Does this remind you anything from the last lecture?

In [None]:
world_data1.isna().mean()*100

In [None]:
world_data.nunique()

**DataFrame.dtypes** is a subset of **DataFrame.info()**

In [55]:
import datetime
world_data['date'] = pd.to_datetime(world_data['DATE'], yearfirst = True, format='%Y-%m-%d') # it creates a datetime64[ns] variable. 
world_data['day'] = world_data['date'].dt.day
world_data['month'] = world_data['date'].dt.month
world_data['year'] = world_data['date'].dt.year
world_data['weekday'] = world_data['date'].dt.dayofweek


In [56]:
world_data.dtypes

DATE                        object
uk_unemployment            float64
usa_unemployment           float64
Source                      object
Diff_U                     float64
date                datetime64[ns]
day                          int64
month                        int64
year                         int64
weekday                      int64
dtype: object

In [None]:
world_data[['DATE','weekday']] # note the double square parenthesis, when you recall more than a column. 


In [62]:
world_data = world_data.set_index(['DATE'])
world_data

Unnamed: 0_level_0,uk_unemployment,usa_unemployment,Source,Diff_U,date,day,month,year,weekday
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1983-01-01,10.7,10.4,FRED,0.3,1983-01-01,1,1,1983,5
1983-02-01,10.8,10.4,FRED,0.4,1983-02-01,1,2,1983,1
1983-03-01,10.8,10.3,FRED,0.5,1983-03-01,1,3,1983,1
1983-04-01,11.0,10.2,FRED,0.8,1983-04-01,1,4,1983,4
1983-05-01,10.9,10.1,FRED,0.8,1983-05-01,1,5,1983,6
...,...,...,...,...,...,...,...,...,...
2019-11-01,3.7,3.5,FRED,0.2,2019-11-01,1,11,2019,4
2019-12-01,3.7,3.5,FRED,0.2,2019-12-01,1,12,2019,6
2020-01-01,3.9,3.6,FRED,0.3,2020-01-01,1,1,2020,2
2020-02-01,3.9,3.5,FRED,0.4,2020-02-01,1,2,2020,5


In [63]:
nineties_data = world_data.loc['1990-01-01':'1999-12-01']
nineties_data

Unnamed: 0_level_0,uk_unemployment,usa_unemployment,Source,Diff_U,date,day,month,year,weekday
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1990-01-01,6.8,5.4,FRED,1.4,1990-01-01,1,1,1990,0
1990-02-01,6.8,5.3,FRED,1.5,1990-02-01,1,2,1990,3
1990-03-01,6.7,5.2,FRED,1.5,1990-03-01,1,3,1990,3
1990-04-01,6.8,5.4,FRED,1.4,1990-04-01,1,4,1990,6
1990-05-01,6.7,5.4,FRED,1.3,1990-05-01,1,5,1990,1
...,...,...,...,...,...,...,...,...,...
1999-08-01,5.9,4.2,FRED,1.7,1999-08-01,1,8,1999,6
1999-09-01,5.7,4.2,FRED,1.5,1999-09-01,1,9,1999,2
1999-10-01,5.8,4.1,FRED,1.7,1999-10-01,1,10,1999,4
1999-11-01,5.7,4.1,FRED,1.6,1999-11-01,1,11,1999,0


In [None]:
nineties_data_reduced = world_data.loc['1990-01-01':'1999-12-01', world_data.columns != 'day']
nineties_data_reduced

In [None]:
nineties_data_reduced = world_data.loc['1990-01-01':'1999-12-01',['date','uk_unemployment','usa_unemployment','weekday']]
nineties_data_reduced

In [None]:
world_data.rename(columns={'LRHUTTTTGBM156S': 'uk_unemployment', 'LRHUTTTTUSM156S': 'usa_unemployment'}, inplace = True)
nineties_data_reduced = world_data.loc['1990-01-01':'1999-12-01',['date','uk_unemployment','usa_unemployment','weekday']]
nineties_data_reduced

In [None]:
nineties_data['UK Unemployment'].mean()
nineties_data['USA Unemployment'].mean()
nineties_data[['UK Unemployment','USA Unemployment']].mean()
nineties_data.describe(exclude=['int64','datetime64[ns]'])


In [None]:
nineties_data = world_data.loc['1990-01-01':'1999-12-01']
nineties_data['UK Unemployment'].mean()


In [None]:
nineties_data['USA Unemployment'].mean()


In [None]:
nineties_data[['UK Unemployment','USA Unemployment']].mean()


In [None]:
nineties_data.describe(exclude=['int64','datetime64[ns]'])


In [64]:
from pandas_datareader import wb
matches = wb.search('government.*debt.*gdp')
matches


Unnamed: 0,id,name,unit,source,sourceNote,sourceOrganization,topics
6804,GB.DOD.TOTL.GD.ZS,"Central government debt, total (% of GDP)",,WDI Database Archives,,b'',
6805,GB.DOD.TOTL.GDP.ZS,"Central government debt, total (% of GDP)",,WDI Database Archives,,b'',
6910,GC.DOD.TOTL.GD.ZS,"Central government debt, total (% of GDP)",,World Development Indicators,Debt is the entire stock of direct government ...,"b'International Monetary Fund, Government Fina...",Economy & Growth ; Public Sector


In [None]:
debt = wb.download(indicator='GC.DOD.TOTL.GD.ZS', country="all", start=2005, end=2016)
debt

In [65]:
debt = wb.download(indicator='GC.DOD.TOTL.GD.ZS', country="all", start=2005, end=2016).stack().unstack(0)
debt

Unnamed: 0_level_0,country,Caribbean small states,Europe & Central Asia (excluding high income),Europe & Central Asia (IDA & IBRD countries),High income,North America,OECD members,Post-demographic dividend,South Asia,South Asia (IDA & IBRD),Albania,...,Tonga,Trinidad and Tobago,Tunisia,Turkiye,Ukraine,United Arab Emirates,United Kingdom,United States,Uruguay,Zambia
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2005,GC.DOD.TOTL.GD.ZS,,,,67.514647,55.64122,67.552922,67.942422,63.096878,63.096878,,...,,,52.42053,,,,90.446724,56.538848,76.11446,
2006,GC.DOD.TOTL.GD.ZS,48.527819,,,65.836828,54.424774,66.023954,66.338816,60.560896,60.560896,,...,,16.803207,48.56391,,,,90.238447,55.467336,69.026896,
2007,GC.DOD.TOTL.GD.ZS,47.400555,,,65.672959,54.249387,65.911099,66.186222,58.281438,58.281438,,...,,15.928004,45.78247,,,,91.710822,55.659926,58.512061,
2008,GC.DOD.TOTL.GD.ZS,,18.00112,18.597792,72.684522,62.118862,71.96995,72.916878,57.099081,57.099081,,...,,,43.287705,40.889175,13.229954,,103.407032,63.81513,57.252683,
2009,GC.DOD.TOTL.GD.ZS,,22.389435,23.425111,85.034118,73.827948,84.187276,85.253423,56.356526,56.356526,,...,,,42.929792,48.471026,23.990034,,121.319808,75.842048,49.273329,
2010,GC.DOD.TOTL.GD.ZS,,20.993859,22.580092,91.046474,82.189046,90.037092,91.205898,52.116468,52.116468,,...,,,38.766,45.145315,28.866637,,129.072972,84.964411,44.222702,17.321511
2011,GC.DOD.TOTL.GD.ZS,,20.094904,22.101109,97.078882,86.442315,95.685822,97.08392,52.273975,52.273975,69.637674,...,,,42.480722,40.076954,26.48082,,141.584861,89.546817,45.351451,18.049909
2012,GC.DOD.TOTL.GD.ZS,,19.469888,21.777398,100.082318,90.343187,98.614801,99.96006,51.349863,51.349863,63.669153,...,,,42.519754,37.960229,32.445912,,144.97486,93.649262,43.40345,23.488919
2013,GC.DOD.TOTL.GD.ZS,,17.987575,20.65863,100.30789,91.804397,99.75125,101.221339,51.093928,51.093928,70.58077,...,49.038084,,,32.121645,35.6314,1.893523,139.55503,95.534688,42.740851,24.223994
2014,GC.DOD.TOTL.GD.ZS,,20.345787,23.083376,104.176166,91.887488,102.339488,104.24872,50.712231,50.712231,73.320227,...,47.476485,,,31.346417,63.665319,,148.494533,95.766699,44.357961,44.395868


In [None]:
debt1 = wb.download(indicator='GC.DOD.TOTL.GD.ZS', country="all", start=2005, end=2016).stack(dropna=False).unstack(0)
debt1

In [None]:
debt2 = wb.download(indicator='GC.DOD.TOTL.GD.ZS', country="all", start=2005, end=2016).unstack(0)
debt2

In [None]:
del [[debt1, debt2]]

In [None]:
debt

In [66]:
debt.index = debt.index.droplevel(1) # drop indicator index 
debt

country,Caribbean small states,Europe & Central Asia (excluding high income),Europe & Central Asia (IDA & IBRD countries),High income,North America,OECD members,Post-demographic dividend,South Asia,South Asia (IDA & IBRD),Albania,...,Tonga,Trinidad and Tobago,Tunisia,Turkiye,Ukraine,United Arab Emirates,United Kingdom,United States,Uruguay,Zambia
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2005,,,,67.514647,55.64122,67.552922,67.942422,63.096878,63.096878,,...,,,52.42053,,,,90.446724,56.538848,76.11446,
2006,48.527819,,,65.836828,54.424774,66.023954,66.338816,60.560896,60.560896,,...,,16.803207,48.56391,,,,90.238447,55.467336,69.026896,
2007,47.400555,,,65.672959,54.249387,65.911099,66.186222,58.281438,58.281438,,...,,15.928004,45.78247,,,,91.710822,55.659926,58.512061,
2008,,18.00112,18.597792,72.684522,62.118862,71.96995,72.916878,57.099081,57.099081,,...,,,43.287705,40.889175,13.229954,,103.407032,63.81513,57.252683,
2009,,22.389435,23.425111,85.034118,73.827948,84.187276,85.253423,56.356526,56.356526,,...,,,42.929792,48.471026,23.990034,,121.319808,75.842048,49.273329,
2010,,20.993859,22.580092,91.046474,82.189046,90.037092,91.205898,52.116468,52.116468,,...,,,38.766,45.145315,28.866637,,129.072972,84.964411,44.222702,17.321511
2011,,20.094904,22.101109,97.078882,86.442315,95.685822,97.08392,52.273975,52.273975,69.637674,...,,,42.480722,40.076954,26.48082,,141.584861,89.546817,45.351451,18.049909
2012,,19.469888,21.777398,100.082318,90.343187,98.614801,99.96006,51.349863,51.349863,63.669153,...,,,42.519754,37.960229,32.445912,,144.97486,93.649262,43.40345,23.488919
2013,,17.987575,20.65863,100.30789,91.804397,99.75125,101.221339,51.093928,51.093928,70.58077,...,49.038084,,,32.121645,35.6314,1.893523,139.55503,95.534688,42.740851,24.223994
2014,,20.345787,23.083376,104.176166,91.887488,102.339488,104.24872,50.712231,50.712231,73.320227,...,47.476485,,,31.346417,63.665319,,148.494533,95.766699,44.357961,44.395868


In [None]:
print(debt.loc["2005"].kurtosis())
print(debt.loc["2005"].skew())

In [None]:
debt.to_csv('debt.csv')
debt.to_stata('debt.dta')
debt.to_json('debt.json')
debt.to_pickle('debt.pkl')
debt.to_parquet('debt.parquet')

Where are these files?

In [None]:
os.getcwd()

In [None]:
pwd

In [None]:
cd 

In [None]:
cd C:\Users\andre\Dropbox\University\Teaching\UEA\Module organiser\