# AIDM7330 Basic Programming for Data Science
# Tutorial: Pandas
## What is Pandas? 

Pandas stands for “Python Data Analysis Library”  
Pandas = panel, dataframe, and series. It is a "flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more." (Pandas GitHub)

## Import Pandas

In [None]:
# import packages
import pandas as pd

# Extra packages
import numpy as np

**Key Points:** Pandas has two / three main data types:
* Series (a cross between a list and a dictionary)
* DataFrames (table or spreadsheet with Series in the columns) [important!]
* Panels (3D version of DataFrame, not as common)

## The Series Data Structure
<img src="../figs/series.png" alt="drawing" width="450"/>

In [None]:
pd.Series?

### Create a series
You can create a series by passing in a list of values.
- Pandas automatically assigns an index starting with zero.

In [None]:
animals = ['Tiger', 'Bear', 'Moose']
s1 = pd.Series(animals)
s1

We can retrieve element by index

In [None]:
s1[2]

If we passed in a list of whole numbers, we could see that panda sets the type to int64.

In [None]:
numbers = [1, 2, 3]
s2 = pd.Series(numbers)
s2

A series can be created from dictionary data. 
- If you do this, the index is automatically assigned to the keys of the dictionary that you provided and not just incrementing integers.

In [None]:
purse = {'money': 12,
          'candy': 3,
          'tissues': 75}
s3 = pd.Series(purse)
s3

In [None]:
s3[2]

In [None]:
s3.index

In [None]:
s1.index

**Access data by .iloc**
- .iloc[] is primarily <ins>integer position</ins> based (from 0 to length-1 of the axis), but may also be used with a boolean array.

In [None]:
s3.iloc[2]

**Access data by .loc**
- To query by the index label, you can use the loc attribute. 

In [None]:
s3.loc['candy']

## The DataFrame Data Structure
Pandas has an oject called a **Data Frame** which is **like a table**. \
<img src="../figs/dataframe.png" alt="drawing" width="450"/>
We use `pd.DataFrame(**inputs**)` and can insert almost any data type as an argument

In [None]:
purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])
df

Extract data using the `iloc` and `loc`

In [None]:
df.iloc[1]

In [None]:
df.loc['Store 2']

Check the data type of the return

In [None]:
type(df.loc['Store 2'])

The indices and column names along either axes, horizontal or vertical, could be non-unique.  
In this example, we see two purchase records for Store 1 as different rows. If we use a single value with the DataFrame `loc` attribute, multiple rows of the DataFrame will return

In [None]:
df.loc['Store 1']

If you wanted to just list the costs for Store 1, you would supply *two parameters* to `.loc`, one being the row index and the other being the column name.

In [None]:
df.loc['Store 1', 'Cost']

**Column selection**

In [None]:
df['Cost']

- Select multiple columns  
If we wanted to include multiply columns, we could do so in a list. And Pandas will bring back only the columns we have asked for.

In [None]:
df.loc[:,['Name', 'Cost']]

- Select multiple consecutive columns

In [None]:
df.loc[:,'Name':'Cost']

### <span style="color:DARKTURQUOISE">Note</span>  
.iloc slicing at specific location - BY POSITION in the table  
`df[a:b] or df.loc[a:b]` by rows  
`df[col] or df[[col]] or df[[col1, col2]] or df.loc[:, [col1, col2]]` by columns  
`df.loc[a:b, x:y]`, by index and column values + location  
`df.iloc[3:5,0:2]`, numeric position in table  
### Drop data  
Use the `drop` function to drop data. This function takes a single parameter, drops specified labels from rows or columns.  

In [None]:
df.drop('Store 1')

In [None]:
df

Nothing happened, why?  
- The drop function doesn't change the DataFrame by default. Instead, it returns a copy of the DataFrame with the given rows removed.
- By default, drop function drops the rows

Drop a row by index

In [None]:
copy_df = df.copy()
copy_df = copy_df.drop(['Store 1'])
copy_df

In [None]:
copy_df.drop?

- Drop a column  
axis : {0 or 'index', 1 or 'columns'}, default 0  
inplace : bool, default False. If True, do operation inplace and return None.

In [None]:
copy_df = df.copy()
copy_df.drop(['Cost'], axis=1, inplace=True)
copy_df

Add a new column with default value of np.nan

In [None]:
df['Location'] = np.nan
df

### Modify data

In [None]:
costs = df['Cost']
costs

In [None]:
costs = costs+2
costs

In [None]:
df

In [None]:
df['Cost']+2

In [None]:
df

In [None]:
df['Cost'] = df['Cost']+2

In [None]:
df

## Dataframe loading
Pandas has built-in support for delimited files such as CSV files to be loaded in.  
`read_csv` function reads a comma-separated values (csv) file into DataFrame.  
`olympics.csv`: has data from Wikipedia that contains a summary list of the medal various countries have won at the Olympics.

In [None]:
df = pd.read_csv('../data/olympics.csv')
df.head()

We see that the first cell has an NaN (i.e., not a number) in it since it's an empty value, and the rows have been automatically indexed for us.
- We usually use `np.nan` to represent missing values

We can use the index call to indicate which column should be the index and we can also use the header parameter to indicate which row from the data file should be used as the header.
- `index_col = 0` set 1st column as index
- `skiprows = 1` tell Pandas to ignore the first row, set a column headers to be read from the second row of data. 

In [None]:
df = pd.read_csv('../data/olympics.csv', index_col = 0, skiprows = 1)
df.head()

Panda's recognize duplicate column names in a *Gold.1* and *Gold.2* etc. to make things more **unique**.  
`df.columns`: The column labels of the DataFrame.

In [None]:
df.columns

## Attributes & general statitics of a Pandas DataFrame

The dataframe consists of 147 rows and 15 columns.

In [None]:
df.shape

`df.size` returns number of elements in dataframe.

In [None]:
df.size # 147*15=2205

In [None]:
df.shape[0] # number of rows

In [None]:
df.shape[1] # number of columns

Method `df.describe()` shows some general statitics.

In [None]:
df.describe()

The last row in df contains total number of each column, which should be exclude in statistics.

In [None]:
df.tail()

In [None]:
df_ex = df[0:df.shape[0]-1] # exclude the last row
df_ex.describe()

Let's check which country receives highest golden medals in summer olympics.

In [None]:
df_ex['Gold'].idxmax()

You can also only show some statistics of some columns.

In [None]:
df_ex.describe().loc[['mean','std'],['Gold','Silver']]

## Querying a DataFrame
- Boolean mask  
Boolean masks are created by applying operators directly to the pandas series or DataFrame objects. Any cell aligned with the true value will be admitted into our final result, and any sign aligned with a false value will not.
<img src="../figs/boolean_mask.png" alt="drawing" width="450"/>

- Find only those countries who have achieved at least one gold medal at the summer Olympics. 

In [None]:
df['Gold'] > 0

In [None]:
df['Gold'][df['Gold']>0]

`Pandas` allows the indexing operator to take a Boolean mask as a value instead of just a list of column names. The syntax might look a little messy, but the result is that you're able to filter and reduce data frames relatively quickly.

In [None]:
only_gold = df[df['Gold'] > 0]
only_gold.head()

In [None]:
only_gold.shape[0]

- Find all countries who have received golds in the summer Olympics or golds in the winter Olympics.  
`|`: OR  
`&`: AND

In [None]:
df[(df['Gold'] > 0) | (df['Gold.1'] > 0)]

In [None]:
len(df[(df['Gold'] > 0) | (df['Gold.1'] > 0)])

- Find all countries who only received gold in winter Olympic and no gold in summer Olympic.

In [None]:
df[(df['Gold.1'] > 0) & (df['Gold'] == 0)]

#### Reset index

In [None]:
df = df.reset_index()
df

### Extended: Check multiple conditions in if statement
Here we’ll study how can we check **multiple** conditions in a single if statement. This can be done by using `and` or `or` or BOTH in a single statement ([more information](https://www.geeksforgeeks.org/check-multiple-conditions-in-if-statement-python/)).

**PROGRAM 1**: program that grants access only to kids aged between 8-12

In [None]:
age = 18
  
if (age>= 8) and (age<= 12): 
    print("YOU ARE ALLOWED. WELCOME !") 
else: 
    print("SORRY ! YOU ARE NOT ALLOWED. BYE !") 

**PROGRAM 2**: program that checks the agreement of the user to the terms.

In [None]:
var = 'N'
  
if (var =='Y') or (var =='y'): 
    print("YOU SAID YES") 
elif(var =='N') or (var =='n'): 
    print("YOU SAID NO") 
else: 
    print("INVALID INPUT")

- The codes in this notebook are modified from various sources. All codes are for educational purposes only and released under the CC1.0. 