# Pandas

Python's way to analyze data in spreadsheet-like objects called *DataFrames*.
Complete with powerful **built-in data visualization**.

Install it: `conda install pandas`

## Series

A **Series** is just a column of data. Think of it like an array.

In [1]:
import pandas as pd

data = [2, 4, 8]
ser = pd.Series(data)
ser

0    2
1    4
2    8
dtype: int64

Your data are indexed by **labels** (0th entry, 1st entry, etc.).
You can add your own labels:

In [2]:
labels = ['a', 'b', 'c']
ser = pd.Series(data, labels)
ser

a    2
b    4
c    8
dtype: int64

In [3]:
# Here's a slick way to make a labeled series:
# NOTE: The indices must be strings: 
d = {'low':10, 'med':20, 'high':30}
ser = pd.Series(d)
ser

low     10
med     20
high    30
dtype: int64

In [4]:
# Access a Series, like a dict.
ser['low']

10

We can add a series to a DataFrame but it's weird.

## DataFrames

DataFrames are just tables of data.

In [5]:
from numpy.random import randn

df = pd.DataFrame(randn(3,3), ['first', 'second', 'third'], ['you', 'me', 'them'])
df

Unnamed: 0,you,me,them
first,-0.106658,-1.314217,0.512528
second,0.459193,0.032293,0.354861
third,1.107927,-0.189876,0.208347


In [6]:
# Print the table without indices:
print(df.to_string(index=False))

      you        me      them
-0.106658 -1.314217  0.512528
 0.459193  0.032293  0.354861
 1.107927 -0.189876  0.208347


Each **column** is a Series.

In [6]:
# Reset the indices:
df.reset_index(inplace=True)
df

Unnamed: 0,index,you,me,them
0,first,-0.43752,-1.689497,-1.576236
1,second,1.196038,2.34952,0.432266
2,third,1.345019,2.440266,-0.809112


In [7]:
# Label your columns.
df = pd.DataFrame(randn(4, 3), columns=['USA', 'Canada', 'Mexico'])
df

Unnamed: 0,USA,Canada,Mexico
0,-0.359947,0.3126,-0.386876
1,0.207077,-1.267222,-1.266063
2,1.17644,0.386349,0.527205
3,-1.216895,-0.74002,-0.915343


In [8]:
# Slice a df to get only certain rows:
df[2:]

Unnamed: 0,USA,Canada,Mexico
2,1.17644,0.386349,0.527205
3,-1.216895,-0.74002,-0.915343


In [9]:
# Grab a specific Series or a couple at once.
df['USA']
df[['USA','Mexico']]

Unnamed: 0,USA,Mexico
0,-0.359947,-0.386876
1,0.207077,-1.266063
2,1.17644,0.527205
3,-1.216895,-0.915343


In [10]:
# Grab a row:
print(df.loc[2])  # Can accept str labels too.
print()
# Also: 
print(df.iloc[3])  # Stands for index locate. Only accepts ints.

USA       1.176440
Canada    0.386349
Mexico    0.527205
Name: 2, dtype: float64

USA      -1.216895
Canada   -0.740020
Mexico   -0.915343
Name: 3, dtype: float64


In [11]:
# Get an element:
df.loc[2, 'USA']

1.176440133934386

In [12]:
# Add a new Series:
df['England'] = df['USA'] + df['Canada']
df

Unnamed: 0,USA,Canada,Mexico,England
0,-0.359947,0.3126,-0.386876,-0.047347
1,0.207077,-1.267222,-1.266063,-1.060145
2,1.17644,0.386349,0.527205,1.562789
3,-1.216895,-0.74002,-0.915343,-1.956914


In [13]:
# Remove a Series:
skimmed_df = df.drop('USA', axis=1, inplace=False)  # Use True for permanent change.
skimmed_df

Unnamed: 0,Canada,Mexico,England
0,0.3126,-0.386876,-0.047347
1,-1.267222,-1.266063,-1.060145
2,0.386349,0.527205,1.562789
3,-0.74002,-0.915343,-1.956914


## Conditional Selection (very important!)

In [14]:
# Create a "mask" of boolean values:
mask = df > 0
mask

Unnamed: 0,USA,Canada,Mexico,England
0,False,True,False,False
1,True,False,False,False
2,True,True,True,True
3,False,False,False,False


In [15]:
# Apply the mask to the df:
df[mask]

Unnamed: 0,USA,Canada,Mexico,England
0,,0.3126,,
1,0.207077,,,
2,1.17644,0.386349,0.527205,1.562789
3,,,,


In [16]:
# Make a mask from a Series:
mask_usa_ser = df['USA'] > 0
mask_usa_ser

0    False
1     True
2     True
3    False
Name: USA, dtype: bool

In [17]:
# And use that Series mask on the df:
df[mask_usa_ser]

Unnamed: 0,USA,Canada,Mexico,England
1,0.207077,-1.267222,-1.266063,-1.060145
2,1.17644,0.386349,0.527205,1.562789


In [18]:
# Use conditionals:
mask1 = df['Mexico'] > 0
mask2 = df['England'] > 0
df[(mask1 & mask2)]  # Use `|` for the "or" operator.

Unnamed: 0,USA,Canada,Mexico,England
2,1.17644,0.386349,0.527205,1.562789


In [19]:
# Make a Series an index:
df.set_index('USA')

Unnamed: 0_level_0,Canada,Mexico,England
USA,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-0.359947,0.3126,-0.386876,-0.047347
0.207077,-1.267222,-1.266063,-1.060145
1.17644,0.386349,0.527205,1.562789
-1.216895,-0.74002,-0.915343,-1.956914


In [23]:
df.iloc[1]

USA        0.207077
Canada    -1.267222
Mexico    -1.266063
England   -1.060145
Name: 1, dtype: float64