# Pandas
Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. It allows us to analyze big data and make conclusions based on statistical theories. Pandas can clean messy data sets, and make them readable and relevant. Pandas also integrates well with matplotlib library, which makes it very handy tool for analyzing the data.

## DataFrame
DataFrame is the widely used data structure of Pandas. Note that, Series are used to work with one dimensional array, whereas DataFrame can be used with two dimensional arrays. DataFrame has two different index, that is, column-index and row-index.

The most common way to create a DataFrame is by using the dictionary of equal-length list as shown below. Further, all the spreadsheets and text files are read as DataFrame, therefore it is very important data structure of Pandas.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Creating a DataFrame using dictionary
peopleDict = {
    "weight": pd.Series([68, 83, 112], index=["alice", "bob", "charles"]),
    "birthyear": pd.Series([1984, 1985, 1992], index=["bob", "alice", "charles"], name="year"),
    "children": pd.Series([0, 3], index=["charles", "bob"]),
    "hobby": pd.Series(["Biking", "Dancing"], index=["alice", "bob"]),
}
people = pd.DataFrame(peopleDict)
people

Unnamed: 0,weight,birthyear,children,hobby
alice,68,1985,,Biking
bob,83,1984,3.0,Dancing
charles,112,1992,0.0,


A few things to note:

* the Series were automatically aligned based on their index,
* missing values are represented as NaN,
* Series names are ignored (the name "year" was dropped),
* DataFrames are displayed nicely in Jupyter notebooks

## Adding and removing columns
You can generally treat DataFrame objects like dictionaries of Series, so the following work fine:

In [3]:
people["age"] = 2021 - people["birthyear"]  # adds a new column "age"
people["over 30"] = people["age"] > 30      # adds another column "over 30"
del people["children"]                      # deletes the column "children"

people

Unnamed: 0,weight,birthyear,hobby,age,over 30
alice,68,1985,Biking,36,True
bob,83,1984,Dancing,37,True
charles,112,1992,,29,False


When you add a new colum, it must have the same number of rows. Missing rows are filled with NaN, and extra rows are ignored:

In [4]:
people["pets"] = pd.Series({"bob": 0, "charles": 5, "eugene":1})  # alice is missing, eugene is ignored
people

Unnamed: 0,weight,birthyear,hobby,age,over 30,pets
alice,68,1985,Biking,36,True,
bob,83,1984,Dancing,37,True,0.0
charles,112,1992,,29,False,5.0


When adding a new column, it is added at the end (on the right) by default. You can also insert a column anywhere else using the `insert()` method:

In [5]:
# Adding an attribute "Qualification" by declaring a new list as a column at column 4
people.insert(4, "Qualification", ["Msc", "MA", "Phd"])
people

Unnamed: 0,weight,birthyear,hobby,age,Qualification,over 30,pets
alice,68,1985,Biking,36,Msc,True,
bob,83,1984,Dancing,37,MA,True,0.0
charles,112,1992,,29,Phd,False,5.0


In [6]:
# Adding an attribute "height" by declaring a new list as a column at column 1
people.insert(1, "height", [172, 181, 185])
people

Unnamed: 0,weight,height,birthyear,hobby,age,Qualification,over 30,pets
alice,68,172,1985,Biking,36,Msc,True,
bob,83,181,1984,Dancing,37,MA,True,0.0
charles,112,185,1992,,29,Phd,False,5.0


In [7]:
# Adding a computed attribute, e.g., BMI = (weight/(height)^2)
people["BMI"] = people["weight"] / (people["height"] / 100) ** 2
people

Unnamed: 0,weight,height,birthyear,hobby,age,Qualification,over 30,pets,BMI
alice,68,172,1985,Biking,36,Msc,True,,22.985398
bob,83,181,1984,Dancing,37,MA,True,0.0,25.335002
charles,112,185,1992,,29,Phd,False,5.0,32.724617


You can also create new columns by calling the assign() method. Note that this returns a new DataFrame object, the original is not modified:

In [8]:
peoplenew = people.assign(overweight =  people["BMI"] > 25)
peoplenew

Unnamed: 0,weight,height,birthyear,hobby,age,Qualification,over 30,pets,BMI,overweight
alice,68,172,1985,Biking,36,Msc,True,,22.985398,False
bob,83,181,1984,Dancing,37,MA,True,0.0,25.335002,True
charles,112,185,1992,,29,Phd,False,5.0,32.724617,True


In [9]:
people

Unnamed: 0,weight,height,birthyear,hobby,age,Qualification,over 30,pets,BMI
alice,68,172,1985,Biking,36,Msc,True,,22.985398
bob,83,181,1984,Dancing,37,MA,True,0.0,25.335002
charles,112,185,1992,,29,Phd,False,5.0,32.724617


You can use assignment expression and set the `inplace = True` option to directly modify the DataFrame.

In [10]:
people.eval("overweight = BMI > 25", inplace = True)
people

Unnamed: 0,weight,height,birthyear,hobby,age,Qualification,over 30,pets,BMI,overweight
alice,68,172,1985,Biking,36,Msc,True,,22.985398,False
bob,83,181,1984,Dancing,37,MA,True,0.0,25.335002,True
charles,112,185,1992,,29,Phd,False,5.0,32.724617,True


In [11]:
# Removing a column
del people["over 30"]
people

Unnamed: 0,weight,height,birthyear,hobby,age,Qualification,pets,BMI,overweight
alice,68,172,1985,Biking,36,Msc,,22.985398,False
bob,83,181,1984,Dancing,37,MA,0.0,25.335002,True
charles,112,185,1992,,29,Phd,5.0,32.724617,True


In [12]:
# We can also pop an attribute from a DataFrame
peoplePets = people.pop('pets')
print (peoplePets)

alice      NaN
bob        0.0
charles    5.0
Name: pets, dtype: float64
