# Data Visualisation 2020 - Notebook 0


## Introduction to the Jupyter environment

# $\S$ 1. Importing Libraries


* There are several libraries we will use frequently in these workbooks throughout the semester. The most frequent ones are 

    __(1)__ __numpy__ (numerical python)

    __(2)__ __matplotlib__ (for plotting)

    __(3)__ __pandas__ (for importing and manipulating data sets)

    __(4)__ __seaborn__ (nice data visulaisation tools built on matplotlib)
    

* We may import each of these with a particular name (or abbreviation) of our choosing. In this workbook they will be imported as follows:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sb

# $\S$ 2. Creating and Importing Data Files

* In many of the practicals we will obtain data files from various sources and use __python__ to perform various tasks with this data.


* We will also create our own data sets and import them into __python__


* In most cases, the data files we create will be of __.csv__ type (__csv__ = __c__omma __s__eparated __v__alues).


* These files are easily created in spread-sheet packages like __Excel__, __LibreOffice__ etc.


* To __import__ these files we use the _pandas_ command __pd.read_csv('some_file_name.csv')__


## Example 1


The cars in a car-park were counted by make, with the following data collected

|  Make  |  Number |
|--------|---------|
| Audi   |    3    |
| BMW    |    2    |
| Citroen|    5    |
| Ford   |    8    |
| Hyundai|    9    |
| Opel   |    6    |
| Toyota |    8    |
| VW     |    6    |




   **1.** Create a file to represent this data in __Excel__ and save this file with the extension __.csv__. Save this csv file in the same dierectory as this notebook, to make things slightly easier. 

   **2.** Import this data file  as __data_eg1__ into __python__ using the command __pd.read_csv()__


### Solution


__1.__ The data is created in Excel in the obvious way. __Make sure__ you save this excel file with the __.csv__ extension!! 



__2.__ The data is import using the Pandas command __read_csv()__ as follows:

In [2]:
data_eg1=pd.read_csv('Example1.csv')

* Thie has usd the pandas (pd) command __read_csv()__ to import the data from the file __Example1.csv__ and call this data set __data_eg1__.

* To display this data set, simply call __data_eg1__ in a python cell, as follows

In [3]:
data_eg1

Unnamed: 0,Make,Number
0,Audi,3
1,BMW,2
2,Citroen,5
3,Ford,8
4,Hyundai,9
5,Opel,6
6,Toyota,8
7,VW,6


## Selecting individual columns & rows

* We see that the data structure has two columns called __Make__ and __Number__. To display an individual column we use the following:

In [None]:
data_eg1['Make']

In [None]:
data_eg1['Number']

* On the very left of the data structure above, we see that each row is numbered from __0__ to __7__ (python always starts counting from zero, so be careful with this).

* We select an individual row using the command __.iloc__ (short for __i__ndex __loc__ation)

In [None]:
data_eg1.iloc[3]

* As we can see this gives the fourth row of the data frame.


* To select show rows _m_ to _n_ we use __.iloc[m:n]__. In this example we will show data rows 0 to 2 and 5 to 6 as follows: 

In [None]:
data_eg1.iloc[0:3]

In [None]:
data_eg1.iloc[5:7]

* We can also combine selections of rows and columns using __.iloc[rows, columns]__

In [None]:
data_eg1.iloc[5:7,0]

* Here we have shown the third-last and second-last rows  of the first column.

## Exercise 1

* The recent closing price of shares in __Ryanair__ on the Irish Stock Exchange __(ISEQ)__ for 5 consecutinve trading days, is shown below. 

__Ryanair Share Closing Price(€) ISEQ   (27 August 2020 - 2 September 2020)__

|Date      |Close |
|----------|------|
|02/09/2020|11.735|
|01/09/2020|11.775|
|31/08/2020|12.085|
|28/08/2020|12.675|
|27/08/2020|12.765|


__1.__ Create a __.csv__ file for this data.

__2.__ Import the data from this file into __python__.

__3.__ Display the data from the individual columns of this data structure.

__4.__ Display the data from the first two rows of the second column.

# $\S$ 3. Python functions acting on data sets

* The main aim of this course is to find the best way to represent the information in a data set. For that reason, there will be a certain amount of mathematical work involved in this course, which we will cover using simple examples and simple data sets during lectures.


* While the mathematics involved is not particularly difficult, when we use real-world data sets (which tend be large), the mathematics can become cumbersome and tedious and practically impossible to complete by hand.


* __Python__  can automate a huge amount of this work for us for these larger data sets, and during the practicals we will try to implement some of the maths we learn during lectures to much larger (and more realistic) data sets.


* Some of the familiar matematical functions we will be applying to data sets will be
  
    * __Mean__ (i.e.  __Average__)
    
    * __Median__ (i.e. __Middle__)
    
    * __Standard Deviation__ (i.e. __Spread__) 
    
    

* There will be others we encounter as we proceed, and they will be introduced in lectures and in these practicals during the semester.

## Example 2

* Create a __numpy array__ to represent the data set


$$S=\{1,5,-32,1,1,4,33,6,-6,10,12,-15,22,3,3,-4,18,-19,2,-2,2,1\}$$

* Using this data structure answer the following:

    __1.__ Find the __mean__ of $S$.
  
   __2.__ Find the __median__ of $S$.
  
   __3.__ Find the __standard deviation__ of $S^2$, where $S^2$ means each element of $S$ should be squared individually.
  
   __4.__ Find the __mean__ of $S^2+3S$.
  
   __5.__ Find the __median__ of $4S^2+S+4$.
  
   __6.__ Sort the data.
  
   __7.__ Find the nuber of data points.
  
   __8.__ Find the  sum of the data values.
  
   __9.__ Divide the result of __8.__ by the result from __7.__ and compare with the velue in __1.__
  
   __10.__ Does the result in __2.__ evenly splid the ordered data set in __6.__?

### Solution

* We create the actual data array using the nummpy command __array([])__ as follows (note the double brackets!!)

In [None]:
S=np.array([1,5,-32,1,1,4,33,6,-6,10,12,-15,22,3,3,-4,18,-19,2,-2,2,1])
S

__1.__ The mean is found using __np.mean()__

In [None]:
np.mean(S)

__2.__ The median is found using __np.median()__

In [None]:
np.median(S)

__3.__ In Python we use __*__ to denote multiplication and __**__ to denote a power. So $S^2$ is written in Python as 

In [None]:
S**2

* We see each individual element of $S$ has be squared.


* The __standard deviaation__ of the ``squared set'' is found using __np.std()__

In [None]:
np.std(S**2)

__4.__ This is just a combination of the results of the other three parts:

In [None]:
np.mean(S**2+3*S)

__5.__ Again, this is a simple combination of the other functions we have encountered:

In [None]:
np.median(S**2+S+4)

__6.__ We use the __np.sort()__ function to arrange the set $S$ in order of increasing values

In [None]:
np.sort(S)

__7.__ We use __len()__ to count the number of data values in $S$ (i.e. it gives the __length__ of $S$)

In [None]:
len(S)

__8.__ We use __sum()__ to add all the values in $S$

In [None]:
sum(S)

__9.__ Carrying out the instructions we find

In [None]:
sum(S)/len(S)

* This agrees with the mean found in part __1.__, as it should!

__10.__ In the sorted data set in part __6.__, we see that half the data values are less than 2 and half the values are greater than 2, so it appears 2 is indeed the median of the data set. 

### Exercise 3

The historical price data of shares in __Amazon__ on the NASDAQ Stock Exchange, from __3 August 2020 - 1 September 2020__, is available at 

[__Amazon NASDAQ Price__](https://www.nasdaq.com/symbol/amzn/historical), 

and can be downloaded as a __.csv__ file from __Moodle->Data Visualisation->Data Files->AMZN.csv__.



* Extract the __Close__, __Open__, __High__ and __Low__ data and answer the following:


   __1.__ Find the mean closing price of the shares over this period, i.e. find the mean of __Close__


   __2.__ Find the median difference between the opening and closing values, i.e. the mean of ( __Close__ - __Open__ )


   __3.__ Find the standard deviation of __High__ 


   __4.__ Find the mean, median and standard deviation of ( __High__ - __Low__ )

### Exercise 4

* Historical data of shares in the chip maker __AMD__ (__A__dvanced __M__icro __D__evices) on the __NASDAQ Stock Exchange__ is available at 

    [__AMD NASDAQ Data__](https://www.nasdaq.com/market-activity/stocks/amd/historical) 


* Downlowad the historical data for the past month from this website. Import this data into __python__ as __AMD__ and repeat the steps of __Exercise 3__ for this data structure.