# Module 3 - Structured Data - Tutorial

In this tutorial, you will be applying the data analytics cycle using the Global Financial Crisis of 2008 as a case study.

If you would like to know a little bit more about it, please check this video.

[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/0aE7___9ZtQ/0.jpg)](https://www.youtube.com/watch?v=0aE7___9ZtQ)


 ### Motivation: The Global Financial Crisis of 2008

You have been hired to be the data analyst of the company *Hesiod*, which analyses real estate indicators of different countries. 

Marcel, the CEO of the company, provides you some datasets with financial and economic indicators and asks your opinion about what to expect in the next couple of years in the housing market. Can you help Marcel?

### The Analytics Process

<img src="graphics/QDAVI_cycle_sm.png" />




### 1. Collecting / Loading Data

The dataset that you need to analyse is in the folder 'data/financial_data_v2.csv'


In [11]:
# import the required Python libraries to process the data:
import pandas as pd                # used for data manipulation and data analysis
import matplotlib.pyplot as plt    # used for visualisation/
import numpy as np

In [2]:

# specify the location and the filename of your dataset
file_path =   

# load the .csv dataset
data = 

# take a look at the dataset
data

General size of the dataset: (62, 17)
The dataset has 62 rows and 17 columns!


In [None]:
# get the dimensions of your dataset
dimensions = 

print( 'General size of the dataset: ' + str( dimensions ))

# extract the number of rows and columns from your data
num_rows = 
num_col  = 
print('The dataset has ' + str( num_rows ) + ' rows and ' + str( num_col ) + ' columns!' )


**QUESTION:** 
- What do the **rows** and **columns** of this dataset **represent**?



**ANSWER HERE**



### 2. CLEAN / PROCESS DATA

An important step in the data analytics cyle is to ensure the **quality** of our data

**QUESTIONS**

Answer the following questions (can be done in group)

- Q1: What is this dataset about?
- Q2: What are the main problems with this dataset? 
- Q3: Is the data relevant to our initial question (our Business concern)?

#### EXPLORE THE DATASET

This dataset describes the **house indexes of different countries in the world**. 

Remeber our question: *what to expect in the next couple of years in the housing market?*

This is a vague question that requires you to **explore** the dataset to get some *insights* about the data.

In [2]:
# WHAT COUNTRIES ARE IN THIS DATASET AND HOW MANY?
# DO THE REQUIRED CODE TO LIST ALL THE CONTRIES THAT ARE IN THE DATASET

# YOUR CODE HERE


In [3]:
# HOW MANY YEARS OF DATA DO WE HAVE IN OUR DATASET?
# PRINT A STRING SAYING: "This dataset describes house priceS from XXXX to XXXX"
# WHERE XXXX CORRESPONDS TO THE YEARS THAT SPAN THIS DATASET

# YOUR CODE HERE



**QUESTION:** For how many years has this information been collected?



**ANSWER**


### Dealing with Missing Data

In Pandas missing data is represented by two value:

- *None*
- *NaN (Not a Number)*

Pandas treats *None* and *NaN* for indicating **missing or null values**. To facilitate this convention, there are several useful functions for **detecting** null values in Pandas DataFrame:

- isnull()
- notnull()

In order to fill null values in a datasets, we use the functions *fillna()*, *dropna()* or *interpolate()*. 

These functions replace NaN values with some value of their own.

- **fillna()** fucntion basically hard-codes a value to replace the NaN entries. It is used for simpler datasets with few missing entries (like in this lecture).

- **interpolate()** function is basically used to fill NA values in the dataframe but it uses various interpolation techniques to fill the missing values rather than hard-coding the value. This method uses **statistical estimations** based on the distribution of the data. This is an advanced method that you might be interested in exploring in your tutorials or in your assignments if you are dealing with a complex dataset

- **dropna()** function basically drops either a row or a column. This is usually applied for datasets where there are variables that have **too many missing entries** and it is not possible to estimate them or to hand-code them

**QUESTION**

What is the percentage of missing house prices in "Australia"?
What about in "United Kingdom"?

In [12]:
# CODE TO COMPUTE YOUR ANSWER HERE (hint: pandas info() function can be useful)



**ANSWER**



In [None]:
# CREATE A NEW COLUMN IN YOUR DATAFRAME CALLED "Australia_Mean" WITH THE SAME INFORMATION AS IN COLUMN "Australia"
# FILL THE MISSING VALUES OF "Australia_Mean" BY COMPUTING THE AVERAGE OF THE HOUSE PRICES IN THAT COLUMN


# PLOT THE HOUSE PRICE INDICES IN THE COLUMNS "AUSTRALIA_MEAN" AND "AUSTRALIA_FULL" AND COMPARE THEM


In [None]:
# CREATE A NEW COLUMN IN YOUR DATAFRAME CALLED "Australia_Interpolate" WITH THE SAME INFORMATION AS IN COLUMN "Australia"



# FILL THE MISSING VALUES OF Australia_Interpolate BY USING THE INTERPOLATE FUNCTION (HINT: interpolate(method ='linear', inplace=True))




# PLOT THE HOUSE PRICE INDICES IN THE COLUMNS "AUSTRALIA_MEAN" AND "AUSTRALIA_FULL" AND COMPARE THEM


**Question**

Discuss the impact of how different methods to deal with missing data can impact your analysis

**For curious students (optional)**

Interpolation is a way to explore different methods to estimate the distrbution of your data.

The column "United Kingdom" has 48% of the data missing. Explore the different price estimations in that country when you use different interpolation methods. Which one approaximates the most to the real values ("United Kingdom_Full")?

method : {‘linear’, ‘time’, ‘index’, ‘values’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘krogh’, ‘polynomial’, ‘spline’, ‘piecewise_polynomial’, ‘from_derivatives’, ‘pchip’, ‘akima’}

### Visualization

We can try to visualize our data to get some more insights. We can plot the house price indexes of Australia from 2003 to 2018. But for that... we need to process data again...

We need to eliminate the *Quarter* dimension of our data by grouping the data by Date (which represents the year). This is performed by making the **average** of the **house price indexes** of each **country** for **ALL QUARTERS** and **aggregating** the results **by year**. In Python, one can do this using the function *.groupby and the aggregation of the data is done by computing the **mean**.
The following figure shows an example.

<img src='graphics/groupBy.png' />


In [None]:

# Group data records from "Australia_Interpolate" and "United Kingdom_Full" by DATE in order to have data in the format:
AU_house_price_indx = 
UK_house_price_indx = 

# Compute the overall average of the house prices in Australia 
AU_house_price_indx_avg = 
print( 'The avarage house price index in Australia is ' + str(round(AU_house_price_indx_avg,4)) )

# Compute the overall average of the house prices in UK
UK_house_price_indx_avg = 
print( 'The avarage house price index in United Kingdom is ' + str(round(UK_house_price_indx_avg,4)) )



In [8]:
# PLOT THE HOUSE PRICES IN  AUSTRALIA AND IN UK IN A SINGLE GRAPH
# DON'T FORGET TO PUT LABELS IN ALL AXIS
# ADD A TITLE TO THE VISUALISATION



**QUESTION** 

Are there similarities between the house price indexes in UK with Australia?



**ANSWER**

In [14]:
# SPLIT YOUR ANALYSIS BEFORE THE CRISIS AND AFTER THE CRISIS

#Select UK house price indexes from 2003 - 2008
UK_house_price_03_08 = 

#Select UK house price indexes from 2003 - 2018
UK_house_price_08_18 =  

#Select AU house price indexes from 2003 - 2008
AU_house_price_03_08 =   

#Select AU house price indexes from 2008 - 2018
AU_house_price_08_18 =


# VISUALISE THE DIFFERENT GRAPHS: COMPARE AUSTRALIA AND UK BETWEEN 2003 - 2008

# YOUR CODE HERE


# VISUALISE THE DIFFERENT GRAPHS: COMPARE AUSTRALIA AND UK BETWEEN 2008 - 2018

# YOUR CODE HERE



**QUESTION**
- Q1: Can you explain the exponential growth of house prices in UK between 2003 - 2007?
- Q2: Why house prices started to drop in UK between 2007 - 2014?
- Q3: What happened to the housing sector in Australia
    - Q3.1. Before 2007:
    - Q3.2. Between 2007 - 2010
    - Q3.3. Between 2010 - 2013
    - Q3.4. After 2014


In [3]:
# SOMETHING TO SUPPORT OUR CONCLUSIONS :-)
from IPython.display import IFrame

IFrame(src='https://www.economist.com/finance-and-economics/2019/03/09/prices-of-prime-properties-around-the-world-are-falling', width=700, height=600)
