# Introduction to Pandas

## What is Pandas?

https://pandas.pydata.org/

- An open source data analysis tool
- Built-in functions to read and write tabular data into most file types (.xlsx, .csv, .hdf, etc)
- Works well with matplotlib and other Python libraries

## Pros for the new user
- Well documented, both officially (pandas.pydata.org) and unofficially (zillions of blogs)
- Large, active user community on YouTube and stackexchange

## Cons for the new user
- The syntax can be very verbose
- People not familiar with Pandas will have trouble with your code if you work in a collaborative environment



In [72]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Pandas dataframe

Dataframes are the primary data object in the Pandas library.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

- Two dimensional tabular data
- Size mutable (2d shape can change, ie the user can add or remove rows and columns)
- Does not have to be homogeneous types (but good practice)

In [73]:
df = pd.DataFrame([1,2,3])
df

Unnamed: 0,0
0,1
1,2
2,3


You can think of Pandas dataframes as a spreadsheet.

In [74]:
a = [1,2,3,4,5]
b = [4,5,7,3,2]
df = pd.DataFrame(
    [a,b],
    index=["first", "second"],
    )
df

Unnamed: 0,0,1,2,3,4
first,1,2,3,4,5
second,4,5,7,3,2


In [75]:
df = df.transpose()
df

Unnamed: 0,first,second
0,1,4
1,2,5
2,3,7
3,4,3
4,5,2


In [76]:
df.sort_values(by="second")

Unnamed: 0,first,second
4,5,2
3,4,3
0,1,4
1,2,5
2,3,7


In [77]:
df["third"] = [3,3,3,3,3]
df

Unnamed: 0,first,second,third
0,1,4,3
1,2,5,3
2,3,7,3
3,4,3,3
4,5,2,3


In [78]:
df["gibberish"] = ["a", "4",4,np.nan,3]
df

Unnamed: 0,first,second,third,gibberish
0,1,4,3,a
1,2,5,3,4
2,3,7,3,4
3,4,3,3,
4,5,2,3,3


## Accessing data within a dataframe

In [79]:
df

Unnamed: 0,first,second,third,gibberish
0,1,4,3,a
1,2,5,3,4
2,3,7,3,4
3,4,3,3,
4,5,2,3,3


In [80]:
df.loc[0,"first"]

1

In [81]:
df.loc[4, "gibberish"]

3

In [82]:
df.gibberish

0      a
1      4
2      4
3    NaN
4      3
Name: gibberish, dtype: object

In [83]:
df["gibberish"]

0      a
1      4
2      4
3    NaN
4      3
Name: gibberish, dtype: object

In [84]:
df.loc[0:3, "first"]

0    1
1    2
2    3
3    4
Name: first, dtype: int64

In [85]:
df.loc[2,:]

first        3
second       7
third        3
gibberish    4
Name: 2, dtype: object

## Assigning values to the dataframe

In [86]:
# new column:
df["all fours"] = 4
df["more fours"] = [4,4,4,4,4]
df

Unnamed: 0,first,second,third,gibberish,all fours,more fours
0,1,4,3,a,4,4
1,2,5,3,4,4,4
2,3,7,3,4,4,4
3,4,3,3,,4,4
4,5,2,3,3,4,4


In [87]:
df.loc[4,"gibberish"] = "Howdy"
df.loc[0,"first"] = 25
df

Unnamed: 0,first,second,third,gibberish,all fours,more fours
0,25,4,3,a,4,4
1,2,5,3,4,4,4
2,3,7,3,4,4,4
3,4,3,3,,4,4
4,5,2,3,Howdy,4,4


In [88]:
df.max()

first         25
second         7
third          3
all fours      4
more fours     4
dtype: int64

In [92]:
df.second.max()

7

In [93]:
df.second.mean()

4.2

In [95]:
df.second*3

0    12
1    15
2    21
3     9
4     6
Name: second, dtype: int64