# Section 5: Intro to Pandas



In [1]:
import pandas as pd
import numpy as np

## Pandas Series

Analysis of the Group 7 -> political group
- ordered sequence of elements, indexed
- looks a lot like list, but there are a ton of differences
    - have associated data type (float64)
    - underlying numpy array
- more similar to numpy array
    - can select elements as would in array
- can define index with strings
    - so actually looks more like a dictionary, but it is ordered


In [3]:
#this is a series in millions, storing population of the 7 countries
g7_pop = pd.Series([35.467, 63.951, 80.940, 60.665, 127.061, 64.511, 318.523])
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
dtype: float64

In [4]:
#can give it a name
g7_pop.name = 'G7 Population in millions'
g7_pop

0     35.467
1     63.951
2     80.940
3     60.665
4    127.061
5     64.511
6    318.523
Name: G7 Population in millions, dtype: float64

In [6]:
#float64
g7_pop.dtype

dtype('float64')

In [7]:
#gives you array of values, is a numpy array
g7_pop.values

array([ 35.467,  63.951,  80.94 ,  60.665, 127.061,  64.511, 318.523])

In [8]:
#selecting individual element
g7_pop[0]

35.467

In [9]:
#finding out index structure
g7_pop.index

RangeIndex(start=0, stop=7, step=1)

In [10]:
#restating index names
g7_pop.index = [
    'Canada',
    'France',
    'Germany',
    'Italy',
    'Japan',
    'United Kingdom',
    'United States',
]
g7_pop

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [11]:
#can do all of this at once
pd.Series({
    'Canada': 35.467,
    'France': 63.951,
    'Germany': 80.94,
    'Italy': 60.665,
    'Japan': 127.061,
    'United Kingdom': 64.511,
    'United States': 318.523
}, name='G7 Population in millions')

Canada             35.467
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

In [12]:
#create series out of other series
pd.Series(g7_pop, index=['France', 'Germany', 'Italy', 'Spain'])

France     63.951
Germany    80.940
Italy      60.665
Spain         NaN
Name: G7 Population in millions, dtype: float64

### Indexing and Slicing

In [13]:
#same syntax as python dictionary to acess specific value
g7_pop['Canada']

35.467

In [14]:
#multiple elements at once
g7_pop[['Italy', 'France']]

Italy     60.665
France    63.951
Name: G7 Population in millions, dtype: float64

In [15]:
#can still look according to number, like the first element
g7_pop.iloc[0]

35.467

In [16]:
#or last element, can also have multiple indexes with numbers
g7_pop.iloc[-1]

318.523

In [18]:
#pandas slicing includes upper limit
g7_pop['Canada': 'Italy']

Canada     35.467
France     63.951
Germany    80.940
Italy      60.665
Name: G7 Population in millions, dtype: float64

### Conditional selection (boolean series)

In [23]:
#returns a boolean serie
g7_pop > 70

Canada            False
France            False
Germany            True
Italy             False
Japan              True
United Kingdom    False
United States      True
Name: G7 Population in millions, dtype: bool

In [20]:
#restricting the series with boolean operators
g7_pop[g7_pop > 70]

Germany           80.940
Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64

In [22]:
#to get answer in terms of people and not millions of people
g7_pop * 1000000

Canada             35467000.0
France             63951000.0
Germany            80940000.0
Italy              60665000.0
Japan             127061000.0
United Kingdom     64511000.0
United States     318523000.0
Name: G7 Population in millions, dtype: float64

In [24]:
g7_pop.mean()

107.30257142857144

In [25]:
g7_pop[g7_pop > g7_pop.mean()]

Japan            127.061
United States    318.523
Name: G7 Population in millions, dtype: float64

In [26]:
g7_pop.std()

97.24996987121581

~ not, | or, & and.

In [27]:
g7_pop[(g7_pop > g7_pop.mean() - g7_pop.std() / 2) | (g7_pop > g7_pop.mean() + g7_pop.std() / 2)]

France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64

### Operations and methods

In [29]:
#other than mean, std, etc, can also use regular numpy math functions
np.log(g7_pop)

Canada            3.568603
France            4.158117
Germany           4.393708
Italy             4.105367
Japan             4.844667
United Kingdom    4.166836
United States     5.763695
Name: G7 Population in millions, dtype: float64

### Modifying series

In [30]:
g7_pop['Canada'] = 40.5

In [31]:
g7_pop

Canada             40.500
France             63.951
Germany            80.940
Italy              60.665
Japan             127.061
United Kingdom     64.511
United States     318.523
Name: G7 Population in millions, dtype: float64