## Introduction
Pandas is a Python library that is used to work with data sets. Data sets are a collection of related information. For example, a list of sales figures can be considered a data set.

Pandas allow us to manipulate, analyse, explore and clean the data set. Cleaning is essential to data analysis as it will improve the data quality, produce more accurate results, and make the data easier to work with.

**Please note**, this notebook assumes you have read through the [IntroToPython](../1_IntroToPython/Python3CrashCourse.ipynb) notebook. If you have not done so, please do so before proceeding.

## Series
A series is a data structure within Pandas which uses a key-value pair. This data structure can be considered a column in a table because it is one-dimensional. A series comprises a key-value pairing, where the key is the indices, and the value is the data stored within the index. Look at the table below for an example:


Demonstrating the key-value pairing in a series:

<table><caption></caption><thead><tr><th>Key</th><th>Value</th></tr></thead><tbody><tr><td>0</td><td>Train</td></tr><tr><td>1</td><td>Plane</td></tr><tr><td>2</td><td>Car</td></tr></tbody></table>

Let's demonstrate this in Python. First, we will need import the Pandas library into our program. To make it easier in the future, I'm going to import Pandas and reference it is as `pd`:

In [1]:
import pandas as pd

Now, let's create a variable that is going to contain our data with is a Python list. The data in this case is modes of transportation.

In [2]:
transportation = ['Train', 'Plane', 'Car']

print(transportation)

['Train', 'Plane', 'Car']


To turn this into a series, let's:
1. Create a new variable `transportation_series`
2. Invoke Panda's Series function `pd.Series`
3. Provide the variable `transportation` to it.
4. Print out the series `print(transportation_series)`

In [3]:
# Remember! We have already imported Pandas as pd in a previous step
transportation_series = pd.Series(transportation)

print(transportation_series)

0    Train
1    Plane
2      Car
dtype: object


Note here the `dtype:`. This is the type of objects within the series. For now, here is a quick list of some examples of objects:
- strings (words) = objects
- integers (numbers) = int64

## DataFrames
DataFrames extend a series. In fact, DataFrames are a grouping of series and can be thought of as a spreadsheet or database because it is two-dimensional (a table with rows and columns). DataFrames are a crucial concept to understand. In this scenario, we are going to load the following data into a DataFrame:
1. Name
2. Age
3. Country of Residence

<table><thead><tr><th>Name</th><th>Age</th><th>Country of Residence</th></tr></thead><tbody><tr><td>Ben</td><td>24</td><td>United Kingdom</td></tr><tr><td>Jacob</td><td>32</td><td>United States of America</td></tr><tr><td>Alice</td><td>19</td><td>Germany&nbsp;</td></tr></tbody></table>

Let's proceed!

In [4]:
# Remember! We have already imported Pandas as pd in a previous step

# Creating a two-dimensional list (remember - rows and columns!)

data = [['Ben', 24, 'United Kingdom'],
        ['Jacob', 32, 'United States of America'],
        ['Alice', 19, 'Germany']]


""" Now we create a new variable (df) to store the DataFrame using the list from above
# We will need to specify the columns in the order of the list. For example:
1. Name = Ben
2. Age = 24
3. United Kingdom

"""
df = pd.DataFrame(data, columns=['Name', 'Age', 'Country of Residence'])

# Now let's print the dataframe (df)
df

Unnamed: 0,Name,Age,Country of Residence
0,Ben,24,United Kingdom
1,Jacob,32,United States of America
2,Alice,19,Germany


Okay! That's a great start. Pandas allows us to do all sorts of manipulating. For example, let's say we wanted to only return a specific row, we can use Panda's `loc` and the index number of the row. In this demonstration, I want to return row #2. The indicie number for this will be 1 (recall that indicies start from 0)

In [5]:
# Rememeber! This is row #2
df.loc[1]

Name                                       Jacob
Age                                           32
Country of Residence    United States of America
Name: 1, dtype: object

## Grouping
Grouping is an essential concept in data analysis. Essentially, grouping is an operation in Pandas that allows us to group our data into categories and do something with them , such as:
- Grouping columns
- Grouping rows
- Comparing

In the example below, the company has hosted an awards night. We can use Panda's `groupby` to group the columns "Department" and "Prize" and `sum` to see how many awards each department has won:  

In [6]:
# Load awards.csv as a dataframe
df = pd.read_csv("awards.csv")
df

Unnamed: 0,Employee,Department,Prize
0,Ben,IT,1
1,James,Accounts,1
2,Elis,Support,0
3,Mohammad,IT,1


In [7]:
# Group the columns "Department" and "Prize"
df.groupby(['Department'])['Prize'].sum() # Use sum to return the sum of the values of each column

Department
Accounts    1
IT          2
Support     0
Name: Prize, dtype: int64

In [8]:
# Group the columns "Department" and "Prize"
df.groupby(['Department'])['Prize'].describe() # Use describe to give a summary breakdown of the data in percentiles

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Accounts,1.0,1.0,,1.0,1.0,1.0,1.0,1.0
IT,2.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
Support,1.0,0.0,,0.0,0.0,0.0,0.0,0.0
