## Andrea Calef

> 💁‍♂️ My details: a.calef@uea.ac.uk; office hours Wednesdays 4-6pm on Teams. Or meet live by appointment! In any case, I recommend emailing me before any meeting, as I may have long queues of students or being stuck in job-related meetings.


The materials for this week are available as a Jupyter notebook. Jupyter notebooks mix rich text with runnable python code. So, you can follow along with this lecture, run the python examples, and even add your own notes and code. 

https://mybinder.org/v2/gh/tturocy/eco7026a/HEAD

Alternatively, you can copy and paste code from here into the python command line or an IDE, such as Spyder. Albeit not been the only way to get access to Jupyter notebook, downloading **Anaconda** is recommended, as it comes with many python-based software (e.g., Jupyter, PyCharm, Spyder, etc). Please click <a href='https://www.anaconda.com/products/distribution#Downloads'>here</a> to download Anaconda's installer.

### Using Jupyter notebooks

To get your own copy of this notebook, choose **File** above then **Download**.

When you have done that, click in the field below, and either press the play button or type Shift+Enter. This executes the Python cell.

## Notes: necessary libraries to replicate this lecture.

pip list 
<br>
print(pd.__version__)'

import numpy as np
<br>
import pandas as pd
<br>
import os

The first time you need to install pandas_datareader: pip install pandas_datareader

The first time you need to install pyarrow: pip install pyarrow 
<br>
import pyarrow as pa
<br>
import pyarrow.parquet as pq

According to Pythonic and PEP-8, all the libraries should be recalled at the beginning of your .py/.ipynb file. 

In this lecture we will upload a couple of .csv files and, at the end of it we will save part of the work via files having different extensions. Before starting any work, it is worth checking the directory ...

In [None]:
pwd 

Or after importing **os** ...

In [None]:
#import os
os.getcwd()

And modify the directory as needed with: 
<br>
cd C:\Your\Directory\Until\the\chosen\folder\
<br>
os.chdir('C:/Your/Directory/Until/the/chosen/folder/')
<br>
Let us try below!

Today we are going to look at **pandas** and **numpy**, which are libraries for data analysis and numerical programming, respectively.
<br>
<br>
**numpy** and **pandas** provide some useful data types of their own, which rather speed up the process of data analysis.
* numpy provides mathematical functions, multidimensional arrays. 
* pandas builds on numpy and helps us handle data for analysis.

## numpy

The main functionality of **numpy** is to help process **arrays**. 
<br>
<br>
Arrays provide a way to store data. In the previous week, we looked at lists. Arrays are a way of taking that concept and making it multi-dimensional. This concept is the same as a **matrix** in Mathematics.
<br>
<br>
**numpy** is a library that stores many functions. When you wish to recall a particular one, a dot (.) is needed. 
It is worth giving a look at the **documentation** of this library by clicking on the following <a href='https://numpy.org/doc/stable/numpy-user.pdf'>hyperlink</a>.

In [None]:
#import numpy as np 
a = np.array([[1,2],[3,4]])
print(a)
print(a*3)


In [None]:
m = np.array([[2,1],[1,2]])
eigen_value, eigen_vector = np.linalg.eig(m)
print(eigen_value)
print(eigen_vector)

Let us import the function to do some linear algebra.

In [None]:
from np import linalg as la

We perform some matricial operations now. For a refresher, please click <a href='https://twister.caps.ou.edu/OBAN2019/Intro_FEM_files/IFEM.AppC.pdf'>here</a>. 

In [None]:
from numpy import linalg as la 
eigen_value_, eigen_vector_ = la.eig(m)
print(eigen_value_)
print(eigen_vector_)

In [None]:
print(la.det(a))
det = la.det(a)
print(np.round(det))
b = la.inv(a)
print(b)
print(np.matmul(b,a))
print(np.round(np.matmul(b,a)))

## pandas

**pandas** is a library that builds on the mathematical tools introduced by numpy to provide a comprehensive set of statistics tools, in a similar way to software like R or STATA.
<br>
<br>
We can use a public dataset as a starting point for producing some real statistics.
<br>
<br>
For example, let us go on Federal Reserve Economic Data (<a href='https://fred.stlouisfed.org/'>**FRED**</a>)'s website, which provides some useful macroeconomic variables we can use for analysis.
<br>
<br>
We will have a look at the UK unemployment rate. 
<br>
<br>
The FRED website can supply us with a CSV file of the relevant variable, which we can work with if we put it in the folder in which Python is running. (We can also use Python’s **requests** module to get the file.)

### pandas: Data structures

Before loading our data, we need to learn about the data types available from the pandas library. 
<br>
<br>
pandas main data types are bool and datetime64, number64, object. 
<br>
<br>
pandas data types can be Series or DataFrame. 
<br>
<br>
A **Series** is like a column of data, whereas a DataFrame is a collection of these Series. 
<br>
<br>
The distinction is similar to that between a column of a spreadsheet, and the whole spreadsheet itself. 

In [None]:
#import pandas as pd
s = pd.Series([3,2,3,4], name='inflation')
print(s)
s = pd.Series([3,2,3,4], name='inflation', index = [2008, 2009, 2010, 2011])
print(s[2008])
print(s.loc[2008:2010])

The Series object is much like a numpy array, but it supports the use of labeling our data with an index <u>and</u> some descriptive statistics features.

In [None]:
print(s.describe())
print(s.count())
print(s.mean())
print(s.median())
print(s.std())
print(s.min())
print(s.quantile(0.25))
print(s.max())

print(s.mode())
print(s.skew())
print(s.kurt())

If the file is stored online on GitHub, just import the file, careful to its extension. 
<br>
If the file is stored on your computer
<br>
import os 
<br>
os. getcwd()
<br>
#os.chdir('C:/Users/andre/Dropbox/University/Teaching/UEA/Module organiser/ECO-7026A Programming and Data Analytics for Behavioural Economists/Lecture 02')


In [None]:
url = 'https://raw.githubusercontent.com/tturocy/eco7026a/main/week2/LRHUTTTTGBM156S.csv'
df = pd.read_csv(url)

In [None]:
print(df)

In [None]:
usa_data = pd.read_csv('https://raw.githubusercontent.com/tturocy/eco7026a/main/week2/LRHUTTTTUSM156S.csv')

Let us give a look at the first four observations ...

In [None]:
usa_data.head(4)

... and the last four ones. 

In [None]:
usa_data.tail(4)

In [None]:
world_data = df.merge(usa_data,sort=True) # The merge considers just the dates in common. 
world_data.head(8) 

In [None]:
world_data1 = df.merge(usa_data,how='outer') # The merge considers the longest time period and generates Nan for the variables with shorter time periods. 
world_data1.head(8) 

We notice that data are not well sorted. Why? What can we do to correct?

In [None]:
world_data1 = world_data1.sort_values(by=['DATE']) # This column sorts world_data1 DataFrame by column Date. 
world_data1.head(8) 

In [None]:
world_data1.index = range(len(world_data1.index))
world_data1.head(8)

Alternative solution 

world_data1.reset_index(drop=True, inplace = True) # This command resets the index. The change is permanement with the option inplace=True. 
world_data1.head(8) 

There are other two solutions, one is written below. See again above to look for the other one ...

In [None]:
world_data1 = df.merge(usa_data,how='outer', sort = True) # This is ok, only because data sets were already sorted by Date.
world_data1.head(8) 

Let us rename the two main variables of our data set. 

In [None]:
world_data.rename(columns={'LRHUTTTTGBM156S': 'uk_unemployment', 'LRHUTTTTUSM156S': 'usa_nemployment'})

Do you notice anything above? 
<br>
Let us see all the data below.

In [None]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None): print(world_data)

Nice, but ...

In [None]:
world_data.rename(columns={'LRHUTTTTGBM156S': 'uk_unemployment', 'LRHUTTTTUSM156S': 'usa_unemployment'}, inplace = True)

In [None]:
print(world_data['usa_unemployment'])


Without the option "inplace = True", the change in the name of the column is temporary and not fixed in the dataframe.

**DataFrame.assign()** adds one or more columns. It can contain functions and/or operations inside.

In [None]:
world_data = world_data.assign(Source = 'FRED', Diff_U = world_data.uk_unemployment - world_data.usa_unemployment) 

Let us better understand the data set we are using before undertaking any analysis.

In [None]:
world_data.dtypes 

In [None]:
world_data.shape 

In [None]:
world_data.info()

In [None]:
world_data1.info()

In [None]:
world_data1.isna()

Does this remind you anything from the last lecture?

In [None]:
world_data1.isna().mean()*100

In [None]:
world_data.nunique()

**DataFrame.dtypes** is a subset of **DataFrame.info()**

In [None]:
import datetime
world_data['date'] = pd.to_datetime(world_data['DATE'], yearfirst = True, format='%Y-%m-%d') # it creates a datetime64[ns] variable. 
world_data['day'] = world_data['date'].dt.day
world_data['month'] = world_data['date'].dt.month
world_data['year'] = world_data['date'].dt.year
world_data['weekday'] = world_data['date'].dt.dayofweek


In [None]:
world_data.dtypes

In [None]:
world_data[['DATE','weekday']] # note the double square parenthesis, when you recall more than a column. 


In [None]:
world_data = world_data.set_index(['DATE'])
world_data

In [None]:
nineties_data = world_data.loc['1990-01-01':'1999-12-01']
nineties_data

In [None]:
nineties_data_reduced = world_data.loc['1990-01-01':'1999-12-01', world_data.columns != 'day']
nineties_data_reduced

In [None]:
nineties_data_reduced = world_data.loc['1990-01-01':'1999-12-01',['date','uk_unemployment','usa_unemployment','weekday']]
nineties_data_reduced

In [None]:
world_data.rename(columns={'LRHUTTTTGBM156S': 'uk_unemployment', 'LRHUTTTTUSM156S': 'usa_unemployment'}, inplace = True)
nineties_data_reduced = world_data.loc['1990-01-01':'1999-12-01',['date','uk_unemployment','usa_unemployment','weekday']]
nineties_data_reduced

In [None]:
nineties_data['uk_unemployment'].mean()
nineties_data['usa_unemployment'].mean()
nineties_data[['uk_unemployment','usa_unemployment']].mean()
nineties_data.describe(exclude=['int64','datetime64[ns]'])


In [None]:
nineties_data = world_data.loc['1990-01-01':'1999-12-01']
nineties_data['uk_unemployment'].mean()


In [None]:
nineties_data['usa_unemployment'].mean()


In [None]:
nineties_data[['uk_unemployment','usa_unemployment']].mean()


In [None]:
nineties_data.describe(exclude=['int64','datetime64[ns]'])


In [None]:
from pandas_datareader import wb
matches = wb.search('government.*debt.*gdp')
matches


In [None]:
debt = wb.download(indicator='GC.DOD.TOTL.GD.ZS', country="all", start=2005, end=2016)
debt

In [None]:
debt = wb.download(indicator='GC.DOD.TOTL.GD.ZS', country="all", start=2005, end=2016).stack().unstack(0)
debt

In [None]:
debt1 = wb.download(indicator='GC.DOD.TOTL.GD.ZS', country="all", start=2005, end=2016).stack(dropna=False).unstack(0)
debt1

In [None]:
debt2 = wb.download(indicator='GC.DOD.TOTL.GD.ZS', country="all", start=2005, end=2016).unstack(0)
debt2

In [None]:
del [[debt1, debt2]]

In [None]:
debt

In [None]:
debt.index = debt.index.droplevel(1) # drop indicator index 
debt

In [None]:
print(debt.loc["2005"].kurtosis())
print(debt.loc["2005"].skew())

In [None]:
debt.to_csv('debt.csv')
debt.to_stata('debt.dta')
debt.to_json('debt.json')
debt.to_pickle('debt.pkl')
debt.to_parquet('debt.parquet')
debt.to_latex('debt.tex')

Please click on the following hyperlinks for additional information about some data formats:

* <a href='https://fileinfo.com/extension/json'>.json</a>
<br>
<br>
* <a href='https://www.databricks.com/glossary/what-is-parquet'>.parquet</a>
<br>
<br>
* <a href='https://pythonnumericalmethods.berkeley.edu/notebooks/chapter11.03-Pickle-Files.html'>.pickle</a>

Where are these files?

In [None]:
os.getcwd()

In [None]:
pwd

In [None]:
cd 

In [None]:
cd C:\Users\andre\Dropbox\University\Teaching\UEA\Module organiser\