Pandas library was first developed by Wes McKinney in 2008 for data manipulation and analysis.

#### References:
    www.python.org
    www.numpy.org
    www.matplotlib.org
    https://pandas.pydata.org

#### Questions/feedback: petert@digipen.edu

# Chapter08: Pandas Dataframe
## pandas
   - DataFrame, Index
   - Data Manipulation
   - <font color="grey">Selection and Filtering</font>
   - <font color="grey">Descriptive Statistics</font>
   - <font color="grey">Read, Write and Load Data</font>

### Import pandas:
    using 'pd'  is standard by Python users
    import frequently used DataFrame and Series onto local namespace is a good practice

In [1]:
import pandas as pd                     # using 'pd'  is standard by Python users
from pandas import DataFrame            # optional, good practice
from pandas import Series               # optional, good practice

import numpy as np
from matplotlib import pyplot as plt
#from matplotlib import colors
%matplotlib notebook

## DataFrame
    - rectangular data (table, spreadsheet), similar to an array of arrays
    - ordered set of columns
    - each column could have different type: str, int, float, boolean, ...
    - column index and row index
    - can be interpreted as a dictionary of Series (using the same index)
##### Examples and basic funtionality:

Create a dataframe using numpy array:

In [5]:
frame = pd.Series(["A", "B", "C", "D"], index=['a', 'c', 'a', 'b'])
frame

a    A
c    B
a    C
b    D
dtype: object

Create a dataframe using random numbers:

In [3]:
frame = pd.DataFrame(np.random.randn(24).reshape(4,6))
frame

Unnamed: 0,0,1,2,3,4,5
0,-1.947631,0.766271,0.336814,1.34733,-2.578976,-0.368
1,0.630119,-0.327587,0.688915,1.205419,-0.314027,-0.668913
2,0.94171,-0.437933,0.655168,2.349226,1.047685,1.864085
3,-0.66177,-0.64226,2.4867,-0.277238,-0.705061,0.364535


Create a dataframe using list of lists:

In [4]:
list = [    [2019, 2019, 2020, 2020, 2021, 2021, 2021], 
            ['CS232', 'CS372', 'CS232', 'CS373', 'CS376', 'CS312', 'CS372'], 
            ['Data Analytics', 'Machine Learning I', 'Data Analytics', 'Machine Learning II', 'Deep Learning', 'Big Data', 'Machine Learning I']
       ]
frame = pd.DataFrame(list)
frame = pd.DataFrame(list).T
frame

Unnamed: 0,0,1,2
0,2019,CS232,Data Analytics
1,2019,CS372,Machine Learning I
2,2020,CS232,Data Analytics
3,2020,CS373,Machine Learning II
4,2021,CS376,Deep Learning
5,2021,CS312,Big Data
6,2021,CS372,Machine Learning I


Create a dictionary of lists as a base for a dataframe:

In [5]:
data = {
    'year':       [2019, 2019, 2020, 2020, 2021, 2021, 2021],
    'courseID':   ['CS232', 'CS372', 'CS232', 'CS373', 'CS376', 'CS312', 'CS372'],
    'courseName': ['Data Analytics', 'Machine Learning I', 'Data Analytics', 'Machine Learning II', 'Deep Learning', 'Big Data', 'Machine Learning I']
}

Create dataframe using the prepared dictionary of lists:

In [6]:
# create dataframe
frame = pd.DataFrame(data)
print(frame)

   year courseID           courseName
0  2019    CS232       Data Analytics
1  2019    CS372   Machine Learning I
2  2020    CS232       Data Analytics
3  2020    CS373  Machine Learning II
4  2021    CS376        Deep Learning
5  2021    CS312             Big Data
6  2021    CS372   Machine Learning I


In [7]:
frame

Unnamed: 0,year,courseID,courseName
0,2019,CS232,Data Analytics
1,2019,CS372,Machine Learning I
2,2020,CS232,Data Analytics
3,2020,CS373,Machine Learning II
4,2021,CS376,Deep Learning
5,2021,CS312,Big Data
6,2021,CS372,Machine Learning I


The use of *head* and *tail* methods allows a peak at the data and its structure at the beginning and the end:

In [9]:
# peak at the first 5 rows
print(frame.head())
# peak at the last  2 rows
frame.tail(3)

   year courseID           courseName
0  2019    CS232       Data Analytics
1  2019    CS372   Machine Learning I
2  2020    CS232       Data Analytics
3  2020    CS373  Machine Learning II
4  2021    CS376        Deep Learning


Unnamed: 0,year,courseID,courseName
4,2021,CS376,Deep Learning
5,2021,CS312,Big Data
6,2021,CS372,Machine Learning I


Note that print removes Pandas formatting of the dataframe

Create another dataframe using 'data' and
- add another column
- specify different than default row indices

In [10]:
# create another dataframe using 'data' and 
# the same column names but
#    add a new column
#    specify indices different than the default 0, 1, 2, ...
frame2 = pd.DataFrame(data, index=['a', 'b', 'c', 'd', 'e', 'f', 'g'], columns=['year', 'courseID', 'courseName', 'day'])
frame2

Unnamed: 0,year,courseID,courseName,day
a,2019,CS232,Data Analytics,
b,2019,CS372,Machine Learning I,
c,2020,CS232,Data Analytics,
d,2020,CS373,Machine Learning II,
e,2021,CS376,Deep Learning,
f,2021,CS312,Big Data,
g,2021,CS372,Machine Learning I,


Examples of filtering and manipulating dataframes using column labels and row indices:

In [11]:
# retrieve a column using attribute of the dataframe
frame2.courseID

a    CS232
b    CS372
c    CS232
d    CS373
e    CS376
f    CS312
g    CS372
Name: courseID, dtype: object

In [12]:
# retrieve another column using attribute/property of the dataframe
frame2.year

a    2019
b    2019
c    2020
d    2020
e    2021
f    2021
g    2021
Name: year, dtype: int64

In [13]:
# retrieve a column using the column name of the dataframe
frame2['courseName']

a         Data Analytics
b     Machine Learning I
c         Data Analytics
d    Machine Learning II
e          Deep Learning
f               Big Data
g     Machine Learning I
Name: courseName, dtype: object

Looks familiar?

The result looks like a Pandas Series: index column, value column and type info

In [14]:
# Check the type:
type(frame2['courseName'])

pandas.core.series.Series

In [15]:
# Look at the previous dataframe again:
frame2

Unnamed: 0,year,courseID,courseName,day
a,2019,CS232,Data Analytics,
b,2019,CS372,Machine Learning I,
c,2020,CS232,Data Analytics,
d,2020,CS373,Machine Learning II,
e,2021,CS376,Deep Learning,
f,2021,CS312,Big Data,
g,2021,CS372,Machine Learning I,


In [16]:
# modify existing values in dataframe using specific indices
dayval = pd.Series(['Monday', 'Tuesday', 'Wednesday', 'Monday'], index=['a', 'f', 'c', 'd'])
frame2.day = dayval
frame2

Unnamed: 0,year,courseID,courseName,day
a,2019,CS232,Data Analytics,Monday
b,2019,CS372,Machine Learning I,
c,2020,CS232,Data Analytics,Wednesday
d,2020,CS373,Machine Learning II,Monday
e,2021,CS376,Deep Learning,
f,2021,CS312,Big Data,Tuesday
g,2021,CS372,Machine Learning I,


In [None]:
# modify existing values in dataframe using specific indices
dayval = pd.Series(['Sunday', 'Saturday'], index=['x', 'y'])
frame2.day = dayval
frame2

In [None]:
# modify existing values in dataframe using specific indices
dayval = pd.Series(['Sunday', 'Saturday'])
frame2.day = dayval
frame2

#### Transposition of a dataframe is similar to numpy arrays:

In [None]:
frame2.T

In [None]:
# The original dataframe has not changed:
frame2

Examples of modifying DataFrame elements in bulk:

In [None]:
# modify all values of a column at once
frame3 = frame2
frame3.day = 'Tuesday'
# or:
frame3['day'] = 'Wednesday'
frame3

Delete a column:

In [None]:
# delete a column
del frame3['day']
frame3

Display index information:

In [None]:
frame3.index

Display values of a dataframe:

In [None]:
frame3.values

In [None]:
type(frame3.values)

##### Dropping rows or columns:

In [None]:
# drop rows based on indices
frame3.drop(['b', 'c', 'd'])

In [None]:
# drop column(s) based on column names and specifying axis=1
frame3.drop('courseName', axis=1)

Note that the action is displayed without calling to display or print the dataframe.

The dataframe has not changed:

In [None]:
# the dataframe has not changed
frame3

The result of the drop could have been assigned to a dataframe or else specify "inplace=True" to take effect:

In [None]:
print('dataframe frame3:')
print(frame3)
frame4 = frame3
frame4.drop(['b', 'c', 'd'], inplace=True)
frame4.drop('courseName', axis=1, inplace=True)
print('\ndataframe frame4:')
print(frame4)

#### Homework 8.1:
Create a data frame and perform below tasks:
- create a 4 x 2 dataframe (4 rows and 2 columns)
- the column labels should be "class" and "midterm"
- row indices should be "first", "second", "third" and "fourth"
- the values should be 4 of your current (or made up) classes names and expected midterm grades accordingly
- add a new column with label "final"
- add expected final grade values to "second" and "fourth" (rows/index labels)
- drop one class (it cannot be CS397!)
- display the dataframe after each change 

In [None]:
# Homework 8.1 code comes here:

