# Elements Of Data Processing - Week 1

## Getting Started with Jupyter Notebook
Jupyter notebook is an extremely useful tool for developing and presenting projects (particularly in python).  You can include code segments and view their output directly in your browser.  You can also add rich text, visualisations, equations and more.

## Cells
Jupyter notebook contains two main types of cells:
- Markdown cells: These can be used to contain text, equations and other non-code items.  The cell that you're reading right now is a markdown cell.  You can use [Markdown](https://www.markdownguide.org/) to format your text.  If you prefer, you can also format your text using <b>HTML</b>.  Clicking the **Run** button will format and display your text.
- Code cells: These contain code segments that can be executed individually.  When executed, the output of the code will be displayed below the code cell.  Click the **Run** button to execute a code segment.  You can also run a code segment by pressing `Ctrl + Enter`

## Running Code
Try running the code segments below and verify that the output is correct.

In [None]:
message="hello world"
print(message)

In [1]:
for i in range(5):
    print(str(i) + " squared is " + str(i*i))

0 squared is 0
1 squared is 1
2 squared is 4
3 squared is 9
4 squared is 16


Variables are retained between code segments.  You can, for example, refer the message variable created in the code segment above

In [3]:
message = "welcome"
print("The COMP20008 team wishes to say: " + message)

The COMP20008 team wishes to say: welcome


Try adding your own code cell below and use it to print a different message.  

## Errors
If your code contains any errors, the error message will be displayed underneath the code segment once it's run.  This helps you identify the problem and debug the code.  Try fixing the code below:

In [6]:
print("Welcome to COMP20008")
print("We're glad you've chosen this subject")
students=30
if students>25
    print('This is a big class!')

SyntaxError: invalid syntax (<ipython-input-6-c53d639eb989>, line 4)

### Exercise 1
Create a new code cell below this one.  Write a Python program that will print the first $n$ numbers of the Fibonacci sequence in **reverse** order.  Verify it works for $n=10$

## Pandas
Libraries contain useful resources, such as classes and subroutines, that you can use in your programs.

Pansas is a library that contains high-level data structures and manipulation tools for faster analysis.  As with most libraries, an [API reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html) is available which details all of the functionality provided by pandas.  This lab will focus on the two most important data structures provided by pandas, the [Series](https://pandas.pydata.org/pandas-docs/stable/reference/series.html) and [Data Frame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html).

It's worth reading through the [Intro to Data Strcutres](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html) article on the pandas website to familiarise yourself with these two data structures.  There are also a number of step-by-step tutorials available online, such as [this one by DataCamp](https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python) that is worth following following.

In [None]:
import pandas as pd

### Series
One-dimensional array-like object containing the array of data and an associated array of data labels called index.

<img src="images/series1.jpg">

The basic method to create a Series:
    
    - s = Series(data, index=index)

Here, data can be different things, including:
    
    - a list
    - an array
    - a dictionary

#### Example 1 : Create a Basic Series Object

In [None]:
# series constructor with data as a list of integers

l = [4,3,-5,9,1,7]
s = pd.Series(l)

In [None]:
# the default indexing starts from zero
s.index

In [None]:
# retrieve the values of the series
s.values

In [None]:
# create your own index using lists
newIndex = ['a','b','c','d','e','f']
s.index  = newIndex

In [None]:
# verify the index
s

In [None]:
# Creating a series from a python dict

Aus_Emission = {'1990':15.45288167, '2000':17.20060983, '2007':17.86526004,
                '2008':18.16087566,'2009':18.20018196,'2010':16.92095367,
                '2011':16.86260095, '2012':16.51938578, '2013':16.34730205}

co2_Emission = pd.Series(Aus_Emission)

In [None]:
# retrieve the values of the series
co2_Emission.values

In [None]:
# verify the series object
co2_Emission

### Slicing
Slicing allows you to take part of a Series or DataFrame, in order to visualise it separately or perform more detailed analysis.  You can **select** sections of list-like types (arrays, tuples, NumPy arrays) by using various slice notations:

In [None]:
# slicing the series using a boolean array operation 
co2_Emission[co2_Emission>16.0]

In [None]:
# slicing the series using a time period
co2_Emission[:'2000']

In [None]:
# double the values of the series object
doubled = co2_Emission*2
doubled

In [None]:
# finding the average value of the series
co2_Emission.mean()

In [None]:
# defining the column name
co2_Emission.name = 'CO2 Emission'

In [None]:
# defining the name of the index
co2_Emission.index.name = 'Year'

In [None]:
# verify the series object
co2_Emission

### Exercise 2

Pandas Series objects have both <i>ndarray-like</i> and <i>dict-like properties</i>. Given the co2_Emission series object do the following:

- Similar to the average of the series object, retrieve the maximum, median and cumulative sum of CO2 emission between  1960 to 2013 (max(), median() and cumsum() methods).


- Retrieve the CO2 emissions in Australia between 2000 to 2010.
- Given the population of Australia in 2013 is 23117353, retrieve the CO2 emission per capita for that year.



In [None]:
###answer here


## Recommended Reading:
[This article on Dataquest](https://www.dataquest.io/blog/jupyter-notebook-tutorial/) is an excellent introduction to Jupyter notebook.  If you haven't used Jupyter notebook before, I recommend familiarising yourself with it.

## Discussion questions 
- What is data science to you? 
- What makes it interesting? 
- What is meant by “Big Data”? What are its characteristics? 
- It has been claimed that wrangling data takes 80% of the time and the rest 20%. How can this be true, what specific activities cause wrangling to be so time consuming?
