In [1]:
import numpy as np

## From Wikipedia:

NumPy (pronounced /ˈnʌmpaɪ/ (NUM-py) or sometimes /ˈnʌmpi/[1][2] (NUM-pee)) is a library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. The ancestor of NumPy, Numeric, was originally created by Jim Hugunin with contributions from several other developers. In 2005, Travis Oliphant created NumPy by incorporating features of the competing Numarray into Numeric, with extensive modifications. NumPy is open-source software and has many contributors.

In [2]:
import pandas as pd

## From Wikipedia:

In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license.[2] The name is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals

In [4]:
from pandas import Series, DataFrame # Optional

In [5]:
a = [1,2,3,4]
a

[1, 2, 3, 4]

In [6]:
a[1:]

[2, 3, 4]

In [8]:
obj = Series([3,6,9,12])
# If Series was not imported explicitly the syntax would be pd.Series()

In [9]:
obj

0     3
1     6
2     9
3    12
dtype: int64

In [10]:
type(obj)

pandas.core.series.Series

### A series has two main components:
1) Values
2) Indices

In [11]:
obj.values()

TypeError: 'numpy.ndarray' object is not callable

The above code gives an error as values is not a method but an attribute of the series itself.
Instead use obj.values without paentheses.

In [12]:
obj.values

array([ 3,  6,  9, 12], dtype=int64)

In [13]:
ww2_casualty = Series([8.7e6, 4.3e6, 3.0e6, 2.1e6, 4e5], index = ["USSR", "Germany", "China", "Japan", "USA"])

In [14]:
ww2_casualty

USSR       8700000.0
Germany    4300000.0
China      3000000.0
Japan      2100000.0
USA         400000.0
dtype: float64

Now this does look like a dataframe

In [15]:
?Series

## However Series is still a one dimensional Array with named indices (dictionary):

Init signature: Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
Docstring:     
One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).

Operations between Series (+, -, /, *, **) align values based on their
associated index values-- they need not be the same length. The result
index will be the sorted union of the two indexes.

In [16]:
ww2_casualty.values

array([8700000., 4300000., 3000000., 2100000.,  400000.])

In [17]:
ww2_casualty.index

Index(['USSR', 'Germany', 'China', 'Japan', 'USA'], dtype='object')

In [18]:
ww2_casualty["USA"]

400000.0

In [20]:
ww2_casualty[ww2_casualty > 4e6] # Check which countries had casualities > 4 million

USSR       8700000.0
Germany    4300000.0
dtype: float64

In [21]:
"USSR" in ww2_casualty

True

In [22]:
"France" in ww2_casualty

False

In [23]:
ww2_cas_dict = ww2_casualty.to_dict()
ww2_cas_dict

{'China': 3000000.0,
 'Germany': 4300000.0,
 'Japan': 2100000.0,
 'USA': 400000.0,
 'USSR': 8700000.0}

In [24]:
ww2_series = Series(ww2_cas_dict)
ww2_series

China      3000000.0
Germany    4300000.0
Japan      2100000.0
USA         400000.0
USSR       8700000.0
dtype: float64

In [25]:
countries = ["China", "Germany", "Japan", "USA", "USSR", "Argentina"]

In [26]:
obj2 = Series(ww2_cas_dict, index = countries)
obj2

China        3000000.0
Germany      4300000.0
Japan        2100000.0
USA           400000.0
USSR         8700000.0
Argentina          NaN
dtype: float64

In [27]:
pd.isnull(obj2)

China        False
Germany      False
Japan        False
USA          False
USSR         False
Argentina     True
dtype: bool

In [28]:
pd.notnull(obj2)

China         True
Germany       True
Japan         True
USA           True
USSR          True
Argentina    False
dtype: bool

In [29]:
pd.isna(obj2)

China        False
Germany      False
Japan        False
USA          False
USSR         False
Argentina     True
dtype: bool

In [30]:
ww2_series

China      3000000.0
Germany    4300000.0
Japan      2100000.0
USA         400000.0
USSR       8700000.0
dtype: float64

In [31]:
obj2

China        3000000.0
Germany      4300000.0
Japan        2100000.0
USA           400000.0
USSR         8700000.0
Argentina          NaN
dtype: float64

In [32]:
ww2_series + obj2 # Adds for same indices. Basically the names of indices must match.

Argentina           NaN
China         6000000.0
Germany       8600000.0
Japan         4200000.0
USA            800000.0
USSR         17400000.0
dtype: float64

## Create a name for a series

In [33]:
obj2.name = 'World war 2 casualities'
obj2

China        3000000.0
Germany      4300000.0
Japan        2100000.0
USA           400000.0
USSR         8700000.0
Argentina          NaN
Name: World war 2 casualities, dtype: float64

## Data Frames

In [34]:
import webbrowser

In [36]:
website = "https://en.wikipedia.org/wiki/List_of_all-time_NFL_win%E2%80%93loss_records"

In [38]:
webbrowser.open(website) # Opens webpage in a browser

True

Copying the first few lines including header from the table in webpage

In [39]:
nfl_frame = pd.read_clipboard() # read directly from clipboard

In [40]:
nfl_frame

Unnamed: 0,Rank,Team,Won,Lost,Tied,Pct.,First NFL Season,Total Games,Division
0,1,Dallas Cowboys,502,374,6,0.573,1960,882,NFC East
1,2,Green Bay Packers,737,562,37,0.565,1921,1336,NFC North
2,3,Chicago Bears,749,579,42,0.562,1920,1370,NFC North
3,4,Miami Dolphins,445,351,4,0.559,1966,800,AFC East
4,5,New England Patriots[b],489,386,9,0.558,1960,884,AFC East
5,6,Minnesota Vikings,470,390,10,0.546,1961,870,NFC North
6,7,Baltimore Ravens,190,161,1,0.541,1996,352,AFC North


In [41]:
type(nfl_frame)

pandas.core.frame.DataFrame

## Data Frame operations- Needless to say probably the most important bit yet

In [44]:
nfl_frame.columns # Gives column names

Index(['Rank ', 'Team ', 'Won ', 'Lost ', 'Tied ', 'Pct. ',
       'First NFL Season ', 'Total Games ', 'Division'],
      dtype='object')

In [45]:
nfl_frame.Division 

0     NFC East
1    NFC North
2    NFC North
3     AFC East
4     AFC East
5    NFC North
6    AFC North
Name: Division, dtype: object

Very weird syntax again. In R it would be nfl_frame$Divison. So Python uses same syntax to access Columns as it uses for methods. However this does not work if column names contain spaces.

In [46]:
nfl_frame.First NFL Season

SyntaxError: invalid syntax (<ipython-input-46-0b1dd6aeefbd>, line 1)

In [50]:
nfl_frame['First NFL Season ']

0    1960
1    1921
2    1920
3    1966
4    1960
5    1961
6    1996
Name: First NFL Season , dtype: int64

In [56]:
nfl_frame[nfl_frame['Won '] > nfl_frame['Lost ']]

Unnamed: 0,Rank,Team,Won,Lost,Tied,Pct.,First NFL Season,Total Games,Division
0,1,Dallas Cowboys,502,374,6,0.573,1960,882,NFC East
1,2,Green Bay Packers,737,562,37,0.565,1921,1336,NFC North
2,3,Chicago Bears,749,579,42,0.562,1920,1370,NFC North
3,4,Miami Dolphins,445,351,4,0.559,1966,800,AFC East
4,5,New England Patriots[b],489,386,9,0.558,1960,884,AFC East
5,6,Minnesota Vikings,470,390,10,0.546,1961,870,NFC North
6,7,Baltimore Ravens,190,161,1,0.541,1996,352,AFC North


In [59]:
my_new_df = DataFrame(nfl_frame, columns = ["Team ", "First NFL Season ", "Total Games "])
my_new_df

Unnamed: 0,Team,First NFL Season,Total Games
0,Dallas Cowboys,1960,882
1,Green Bay Packers,1921,1336
2,Chicago Bears,1920,1370
3,Miami Dolphins,1966,800
4,New England Patriots[b],1960,884
5,Minnesota Vikings,1961,870
6,Baltimore Ravens,1996,352


If you specify a column name that does not exist it will simply create a column of that name and fill it up with NA's

In [60]:
nfl_frame.head()

Unnamed: 0,Rank,Team,Won,Lost,Tied,Pct.,First NFL Season,Total Games,Division
0,1,Dallas Cowboys,502,374,6,0.573,1960,882,NFC East
1,2,Green Bay Packers,737,562,37,0.565,1921,1336,NFC North
2,3,Chicago Bears,749,579,42,0.562,1920,1370,NFC North
3,4,Miami Dolphins,445,351,4,0.559,1966,800,AFC East
4,5,New England Patriots[b],489,386,9,0.558,1960,884,AFC East


In [61]:
nfl_frame.tail(4)

Unnamed: 0,Rank,Team,Won,Lost,Tied,Pct.,First NFL Season,Total Games,Division
3,4,Miami Dolphins,445,351,4,0.559,1966,800,AFC East
4,5,New England Patriots[b],489,386,9,0.558,1960,884,AFC East
5,6,Minnesota Vikings,470,390,10,0.546,1961,870,NFC North
6,7,Baltimore Ravens,190,161,1,0.541,1996,352,AFC North


In [66]:
nfl_frame.iloc[3] # Accessing rows by index

Rank                               4
Team                 Miami Dolphins 
Won                              445
Lost                             351
Tied                               4
Pct.                           0.559
First NFL Season                1966
Total Games                     800 
Division                    AFC East
Name: 3, dtype: object

In [67]:
nfl_frame.iloc[1:4] # This does not include 1 itself

Unnamed: 0,Rank,Team,Won,Lost,Tied,Pct.,First NFL Season,Total Games,Division
1,2,Green Bay Packers,737,562,37,0.565,1921,1336,NFC North
2,3,Chicago Bears,749,579,42,0.562,1920,1370,NFC North
3,4,Miami Dolphins,445,351,4,0.559,1966,800,AFC East


In [68]:
nfl_frame.iloc

<pandas.core.indexing._iLocIndexer at 0x260eef0>

In [69]:
nfl_frame.ix[1:4]

Unnamed: 0,Rank,Team,Won,Lost,Tied,Pct.,First NFL Season,Total Games,Division
1,2,Green Bay Packers,737,562,37,0.565,1921,1336,NFC North
2,3,Chicago Bears,749,579,42,0.562,1920,1370,NFC North
3,4,Miami Dolphins,445,351,4,0.559,1966,800,AFC East
4,5,New England Patriots[b],489,386,9,0.558,1960,884,AFC East


In [71]:
nfl_frame["Stadium"] = "Levis_Stadium" # Creating a new column
nfl_frame

Unnamed: 0,Rank,Team,Won,Lost,Tied,Pct.,First NFL Season,Total Games,Division,Stadium
0,1,Dallas Cowboys,502,374,6,0.573,1960,882,NFC East,Levis_Stadium
1,2,Green Bay Packers,737,562,37,0.565,1921,1336,NFC North,Levis_Stadium
2,3,Chicago Bears,749,579,42,0.562,1920,1370,NFC North,Levis_Stadium
3,4,Miami Dolphins,445,351,4,0.559,1966,800,AFC East,Levis_Stadium
4,5,New England Patriots[b],489,386,9,0.558,1960,884,AFC East,Levis_Stadium
5,6,Minnesota Vikings,470,390,10,0.546,1961,870,NFC North,Levis_Stadium
6,7,Baltimore Ravens,190,161,1,0.541,1996,352,AFC North,Levis_Stadium


In [72]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [74]:
nfl_frame['Stadiums_Number'] = np.arange(7)

In [75]:
nfl_frame

Unnamed: 0,Rank,Team,Won,Lost,Tied,Pct.,First NFL Season,Total Games,Division,Stadium,Stadiums_Number
0,1,Dallas Cowboys,502,374,6,0.573,1960,882,NFC East,Levis_Stadium,0
1,2,Green Bay Packers,737,562,37,0.565,1921,1336,NFC North,Levis_Stadium,1
2,3,Chicago Bears,749,579,42,0.562,1920,1370,NFC North,Levis_Stadium,2
3,4,Miami Dolphins,445,351,4,0.559,1966,800,AFC East,Levis_Stadium,3
4,5,New England Patriots[b],489,386,9,0.558,1960,884,AFC East,Levis_Stadium,4
5,6,Minnesota Vikings,470,390,10,0.546,1961,870,NFC North,Levis_Stadium,5
6,7,Baltimore Ravens,190,161,1,0.541,1996,352,AFC North,Levis_Stadium,6


In [76]:
nfl_frame['Total_Matches'] = nfl_frame['Won '] + nfl_frame['Lost ']

In [77]:
nfl_frame

Unnamed: 0,Rank,Team,Won,Lost,Tied,Pct.,First NFL Season,Total Games,Division,Stadium,Stadiums_Number,Total_Matches
0,1,Dallas Cowboys,502,374,6,0.573,1960,882,NFC East,Levis_Stadium,0,876
1,2,Green Bay Packers,737,562,37,0.565,1921,1336,NFC North,Levis_Stadium,1,1299
2,3,Chicago Bears,749,579,42,0.562,1920,1370,NFC North,Levis_Stadium,2,1328
3,4,Miami Dolphins,445,351,4,0.559,1966,800,AFC East,Levis_Stadium,3,796
4,5,New England Patriots[b],489,386,9,0.558,1960,884,AFC East,Levis_Stadium,4,875
5,6,Minnesota Vikings,470,390,10,0.546,1961,870,NFC North,Levis_Stadium,5,860
6,7,Baltimore Ravens,190,161,1,0.541,1996,352,AFC North,Levis_Stadium,6,351


In [80]:
stadiums = Series(["Levi's Stadium", "AT&T Stadium"], index = [4,0])
stadiums

4    Levi's Stadium
0      AT&T Stadium
dtype: object

In [81]:
nfl_frame["Stadium"] = stadiums
nfl_frame

Unnamed: 0,Rank,Team,Won,Lost,Tied,Pct.,First NFL Season,Total Games,Division,Stadium,Stadiums_Number,Total_Matches
0,1,Dallas Cowboys,502,374,6,0.573,1960,882,NFC East,AT&T Stadium,0,876
1,2,Green Bay Packers,737,562,37,0.565,1921,1336,NFC North,,1,1299
2,3,Chicago Bears,749,579,42,0.562,1920,1370,NFC North,,2,1328
3,4,Miami Dolphins,445,351,4,0.559,1966,800,AFC East,,3,796
4,5,New England Patriots[b],489,386,9,0.558,1960,884,AFC East,Levi's Stadium,4,875
5,6,Minnesota Vikings,470,390,10,0.546,1961,870,NFC North,,5,860
6,7,Baltimore Ravens,190,161,1,0.541,1996,352,AFC North,,6,351


Delete a columm

In [82]:
del nfl_frame['Stadiums_Number']

In [83]:
nfl_frame

Unnamed: 0,Rank,Team,Won,Lost,Tied,Pct.,First NFL Season,Total Games,Division,Stadium,Total_Matches
0,1,Dallas Cowboys,502,374,6,0.573,1960,882,NFC East,AT&T Stadium,876
1,2,Green Bay Packers,737,562,37,0.565,1921,1336,NFC North,,1299
2,3,Chicago Bears,749,579,42,0.562,1920,1370,NFC North,,1328
3,4,Miami Dolphins,445,351,4,0.559,1966,800,AFC East,,796
4,5,New England Patriots[b],489,386,9,0.558,1960,884,AFC East,Levi's Stadium,875
5,6,Minnesota Vikings,470,390,10,0.546,1961,870,NFC North,,860
6,7,Baltimore Ravens,190,161,1,0.541,1996,352,AFC North,,351


In [85]:
old_dict = {"SF":8.37e5, "LA":3.88e6, "NYC":8.4e6}
old_dict

{'LA': 3880000.0, 'NYC': 8400000.0, 'SF': 837000.0}

In [86]:
data_dict = {"City":["SF", "LA", "NYC"], "Population":[8.37e5, 3.88e6, 8.4e6]}
data_dict

{'City': ['SF', 'LA', 'NYC'], 'Population': [837000.0, 3880000.0, 8400000.0]}

In [87]:
city_frame = DataFrame(data_dict)
city_frame

Unnamed: 0,City,Population
0,SF,837000.0
1,LA,3880000.0
2,NYC,8400000.0


In [88]:
city_Series = Series(data_dict)
city_Series

City                             [SF, LA, NYC]
Population    [837000.0, 3880000.0, 8400000.0]
dtype: object

## Index Objects

In [89]:
my_ser = Series([1,2,3,4], index = ['A', 'B', 'C','D'])
my_ser

A    1
B    2
C    3
D    4
dtype: int64

In [90]:
my_index = my_ser.index
my_index

Index(['A', 'B', 'C', 'D'], dtype='object')

In [91]:
type(my_index)

pandas.core.indexes.base.Index

In [92]:
my_index[2]

'C'

In [94]:
my_index[2:]

Index(['C', 'D'], dtype='object')

In [96]:
my_index[0] = "Z"
# Index objects are immutable

TypeError: Index does not support mutable operations

## Reindexing

In [97]:
from numpy.random import randn
# randn is a random number generator

In [113]:
ser1 = Series([1,2,3,4], index = ['A', 'B', 'C','D'])
ser1

A    1
B    2
C    3
D    4
dtype: int64

In [114]:
ser2

A    1.0
B    2.0
C    3.0
D    4.0
E    NaN
F    NaN
dtype: float64

In [115]:
ser2.reindex(['A','B','C','D','E','F','G'],fill_value = 253)

A      1.0
B      2.0
C      3.0
D      4.0
E      NaN
F      NaN
G    253.0
dtype: float64

In [116]:
ser3 = Series(['USA', 'Mexico', 'Canada'], index = [0,5,10])
ser3

0        USA
5     Mexico
10    Canada
dtype: object

In [117]:
ranger = np.arange(15)
ranger

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [118]:
ser3.reindex(ranger)

0        USA
1        NaN
2        NaN
3        NaN
4        NaN
5     Mexico
6        NaN
7        NaN
8        NaN
9        NaN
10    Canada
11       NaN
12       NaN
13       NaN
14       NaN
dtype: object

In [122]:
ser3.reindex(ranger,method='ffill') # ffil means forward fill

0        USA
1        USA
2        USA
3        USA
4        USA
5     Mexico
6     Mexico
7     Mexico
8     Mexico
9     Mexico
10    Canada
11    Canada
12    Canada
13    Canada
14    Canada
dtype: object

In [123]:
ser3 # Original series remains unchanged

0        USA
5     Mexico
10    Canada
dtype: object

In [124]:
ser3.reindex(ranger,method='bfill')

0        USA
1     Mexico
2     Mexico
3     Mexico
4     Mexico
5     Mexico
6     Canada
7     Canada
8     Canada
9     Canada
10    Canada
11       NaN
12       NaN
13       NaN
14       NaN
dtype: object

In [125]:
randn(5)

array([ 0.51573919, -0.46967725,  2.38527831,  0.64677897,  0.02772831])

In [126]:
randn(25).reshape(5,5) # Again here reshape is a function

array([[ 0.07036027,  1.02584289,  1.13845041,  0.08700253,  0.48042082],
       [ 1.0687839 ,  2.52710381, -1.67602422,  0.75876968,  0.50159634],
       [-1.66468448,  0.03053313, -0.80049472, -0.97117358,  1.09027849],
       [-1.41745253, -0.09666368, -0.16028942, -0.46611173, -0.15740246],
       [-2.65322806, -0.37392309, -0.44690049,  0.81935911, -0.75096421]])

In [129]:
dframe = DataFrame(randn(25).reshape(5,5), index = ['A','B','C','D','E'], columns = ['col1','col2','col3','col4','col5'])

In [130]:
dframe

Unnamed: 0,col1,col2,col3,col4,col5
A,1.197888,0.036431,1.058139,0.554925,0.171594
B,-0.854241,-0.241092,-1.30375,-2.181057,0.091809
C,0.584431,0.716713,0.643597,-1.297821,1.844025
D,0.623463,0.869859,0.574391,-0.252925,0.710854
E,-0.840069,-2.282255,-0.77501,0.272248,1.519339


## Hmm so this is how you actually give row and column names. index stands for rownames and columns is colnames

In [131]:
dframe.columns

Index(['col1', 'col2', 'col3', 'col4', 'col5'], dtype='object')

In [132]:
dframe.index

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

In [133]:
dframe2 = dframe.reindex(['A','B','C','D','E','F'])

In [134]:
dframe2 # In this case entire row becomes NA's

Unnamed: 0,col1,col2,col3,col4,col5
A,1.197888,0.036431,1.058139,0.554925,0.171594
B,-0.854241,-0.241092,-1.30375,-2.181057,0.091809
C,0.584431,0.716713,0.643597,-1.297821,1.844025
D,0.623463,0.869859,0.574391,-0.252925,0.710854
E,-0.840069,-2.282255,-0.77501,0.272248,1.519339
F,,,,,


In [135]:
new_columns = ['col1','col2','col3','col4','col5','col6']
dframe2.reindex(columns = new_columns)

Unnamed: 0,col1,col2,col3,col4,col5,col6
A,1.197888,0.036431,1.058139,0.554925,0.171594,
B,-0.854241,-0.241092,-1.30375,-2.181057,0.091809,
C,0.584431,0.716713,0.643597,-1.297821,1.844025,
D,0.623463,0.869859,0.574391,-0.252925,0.710854,
E,-0.840069,-2.282255,-0.77501,0.272248,1.519339,
F,,,,,,


Obviously here the column becomes NA's

## Index Hiearchy

In [136]:
ser = Series(randn(6), index = [[1,1,1,2,2,2],['a','b','c','a','b','c']])
ser

1  a   -0.894945
   b    2.270272
   c   -0.616018
2  a   -0.095100
   b    2.471409
   c   -0.280064
dtype: float64

In [137]:
ser.index

MultiIndex(levels=[[1, 2], ['a', 'b', 'c']],
           labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

In [139]:
ser[1] # 1st level index

a   -0.894945
b    2.270272
c   -0.616018
dtype: float64

In [140]:
ser[2]

a   -0.095100
b    2.471409
c   -0.280064
dtype: float64