# Introduction to Python - Lecture 09 (31Oct 2018)

### Agenda for today:
+ Introduction to Pandas
+ Introduction to Seaborn

#### Recap

In [None]:
lst = np.array(
      [[ 1,  1],
       [ 4,  3],
       [ 0,  1],
       [-1,  1],
       [ 0,  1],
       [ 4, -2],
       [-5,  1],
       [-1,  0],
       [-3,  3],
       [ 3,  3]])

#### How do you return the first column? [1, 4, 0, -1, 0, 4, -5, -1, -3, 3]

#### How do you return the second column? [1, 3, 1, 1, 1, -2, 1, 0, 3, 3]

#### How do get the sum of each row?

#### How can we calculate the z score of each row?

---

## Pandas

Pandas is an external library like numpy and seaborn and needs to be installed using a package manager.

Anaconda:
+ conda install pandas
Pip:
+ pip install pandas

**Note** Seaborn requires pandas as a dependancy, so you should already have it installed.

When we would like to use pandas we need to import it

In [7]:
import numpy as np
import pandas as pd

#### What is a dataframe?

A dataframe is a collection of data where each row consists of a collection of observations.

#### Creating a dataframe

There are many ways to create dataframes:
+ Coverting a dictionary to a dataframe
    ```python
df = pd.Dataframe.from_dict( << dict >> )
    ```
+ Loading the data from a csv file
    ```python
df = pd.read_csv( << csv_path >> )
    ```
+ Load the data from a url
    ```python
url = "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/trees.csv"
df = pd.read_csv(url)
    ```

#### Loading data from a dictionary

Some of these examples are taken from the pandas documentation (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.from_dict.html#pandas.DataFrame.from_dict)

###### By default, each item in the dictionary will represent a column

In [17]:
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data)

Unnamed: 0,col_1,col_2
0,3,a
1,2,b
2,1,c
3,0,d


###### This can be changed by changing the orient parameter to 'index' (the default is 'column')

In [18]:
data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
pd.DataFrame.from_dict(data, orient='index')

Unnamed: 0,0,1,2,3
row_1,3,2,1,0
row_2,a,b,c,d


###### The names of the columns can be set using the columns parameter

In [19]:
pd.DataFrame.from_dict(data, orient='index',
                        columns=['A', 'B', 'C', 'D'])

Unnamed: 0,A,B,C,D
row_1,3,2,1,0
row_2,a,b,c,d


###### Alternatively you can specify the column names in the dictionary

In [21]:
data = {
    'Tree1': {'girth': 8.3, 'height': 70, 'volume': 10.3},
    'Tree2': {'girth': 8.6, 'height': 65, 'volume': 10.3},
    'Tree3': {'girth': 8.8, 'height': 63, 'volume': 10.2}
}
pd.DataFrame.from_dict(data, orient='index')

Unnamed: 0,girth,height,volume
Tree1,8.3,70,10.3
Tree2,8.6,65,10.3
Tree3,8.8,63,10.2


**Note** Each row needs to have a unique identifier, in the above example this is represented by '**Tree#**'. Generally this is represented by an integer ranging from 0->n. 
+ In the above example we can reset the index to be the integers using the reset_index() function.

In [25]:
pd.DataFrame.from_dict(data, orient='index').reset_index()

Unnamed: 0,index,girth,height,volume
0,Tree1,8.3,70,10.3
1,Tree2,8.6,65,10.3
2,Tree3,8.8,63,10.2


###### Loading Data from a URL

In [31]:
url = "https://vincentarelbundock.github.io/Rdatasets/csv/datasets/trees.csv"
df = pd.read_csv(url)
df

Unnamed: 0.1,Unnamed: 0,Girth,Height,Volume
0,1,8.3,70,10.3
1,2,8.6,65,10.3
2,3,8.8,63,10.2
3,4,10.5,72,16.4
4,5,10.7,81,18.8
5,6,10.8,83,19.7
6,7,11.0,66,15.6
7,8,11.0,75,18.2
8,9,11.1,80,22.6
9,10,11.2,75,19.9


### Accessing values in the dataframe

#### Columns

Access columns using the column name in square brackets to return a series containing the data.
```python
df["column_name"]
```

+ A series is a 1D array of id, value pairs

In [30]:
df = pd.DataFrame.from_dict(data, orient='index')

df["girth"]

Tree1    8.3
Tree2    8.6
Tree3    8.8
Name: girth, dtype: float64

Unnamed: 0.1,Unnamed: 0,Girth,Height,Volume
0,1,8.3,70,10.3
1,2,8.6,65,10.3
2,3,8.8,63,10.2
3,4,10.5,72,16.4
4,5,10.7,81,18.8
5,6,10.8,83,19.7
6,7,11.0,66,15.6
7,8,11.0,75,18.2
8,9,11.1,80,22.6
9,10,11.2,75,19.9


---

## Seaborn



In [4]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline