# **Spotlight on Pandas**
By Dongwei Qi

**Introduction to pandas**

Pandas is a python library for data manipulation and analysis. Pandas offers data structures and operations for manipulating numerical tables and time series. As is mainly used for machine learning in form of dataframes, Pandas allow importing data of various file formats such as csv, excel etc. Pandas allows various data manipulation operations such as groupby, join, merge, melt, concatenation as well as data cleaning features such as filling, replacing or imputing null values.

In this spotlight on pandas, I will show you how to manipulate the basic data structure of pandas: series and data frame. And I will make use of Movie Lens dataset as an example of operations using pandas.

## **1. Basic data structure with pandas**

Let's start with the fundamental data structures in pandas.


In [0]:
import numpy as np
import pandas as pd

###**1.1 Series**

**Series** is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is to call:

*s = pd.Series(data, index=index)*

where data can be the aforementioned python data types and the passed index is a list of axis labels. Index must be the same length as data. If no index is passed, one will be created having values [0, ..., len(data) - 1].


In [111]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
print(s.index)

a   -1.142993
b   -0.017482
c   -0.092377
d   -0.191771
e   -1.238829
dtype: float64
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')


In [112]:
s=pd.Series(np.random.randn(5))
print(s)
print(s.index)

0    0.288414
1    0.074328
2   -0.107528
3   -1.114890
4    0.226366
dtype: float64
RangeIndex(start=0, stop=5, step=1)


We can also instantiate a series from dicts:

In [113]:
d = {'b': 1, 'a': 0, 'c': 2}
print(pd.Series(d))

b    1
a    0
c    2
dtype: int64


If an index is passed, the values in data corresponding to the labels in the index will be pulled out.



In [114]:
print(pd.Series(d, index=['b', 'c', 'd', 'a']))

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64


**NaN** (not a number) is the standard missing data marker used in pandas.




If you need the actual array backing a Series, use Series.array.

While Series is ndarray-like, if you need an actual ndarray, then use Series.to_numpy().


In [115]:
print(s.array)
print(s.to_numpy())

<PandasArray>
[ 0.28841414838923907,  0.07432782642052534, -0.10752752418629212,
  -1.1148895492180102,  0.22636634886579018]
Length: 5, dtype: float64
[ 0.28841415  0.07432783 -0.10752752 -1.11488955  0.22636635]


A Series is like a fixed-size dict in that you can get and set values by index label:



In [116]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
print(s['a'])
print('f' in s)

a   -0.430276
b   -1.142740
c    0.380328
d   -0.236932
e   -0.695489
dtype: float64
-0.4302757086851164
False


When working with raw NumPy arrays, looping through value-by-value is usually not necessary. The same is true when working with Series in pandas. Series can also be passed into most NumPy methods expecting an ndarray.


In [117]:
print(s+s)

a   -0.860551
b   -2.285481
c    0.760656
d   -0.473864
e   -1.390978
dtype: float64


In [118]:
print(s*2)

a   -0.860551
b   -2.285481
c    0.760656
d   -0.473864
e   -1.390978
dtype: float64


###**1.2 DataFrame**

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

Dict of 1D ndarrays, lists, dicts, or Series,
2-D numpy.ndarray,
Structured or record ndarray,
A Series,
Another DataFrame


In [119]:
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df=pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [120]:
print(df.index)
print(df.columns)

Index(['a', 'b', 'c', 'd'], dtype='object')
Index(['one', 'two'], dtype='object')



Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict of Series plus a specific index will discard all data not matching up to the passed index.

If axis labels are not passed, they will be constructed from the input data based on common sense rules.


In [121]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4.0
b,2.0,2.0
a,1.0,1.0


In [122]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

Unnamed: 0,two,three
d,4.0,
b,2.0,
a,1.0,


We can construct a dataframe from a list of dicts.

The index and column can be changed later.


In [123]:
data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
pd.DataFrame(data)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [124]:
pd.DataFrame(data, index=['first', 'second'])

Unnamed: 0,a,b,c
first,1,2,
second,5,10,20.0


In [125]:
pd.DataFrame(data, columns=['a', 'b'])

Unnamed: 0,a,b
0,1,2
1,5,10


In [126]:
pd.DataFrame(data, columns=['b', 'a','d'])

Unnamed: 0,b,a,d
0,2,1,
1,10,5,


DataFrame.from_dict takes a dict of dicts or a dict of array-like sequences and returns a DataFrame. It operates like the DataFrame constructor except for the orient parameter which is 'columns' by default, but which can be set to 'index' in order to use the dict keys as row labels.



In [127]:
pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]))

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


If you pass orient='index', the keys will be the row labels. In this case, you can also pass the desired column names:



In [128]:
pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]),
                       orient='index', columns=['one', 'two', 'three'])


Unnamed: 0,one,two,three
A,1,2,3
B,4,5,6


You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:



In [129]:
d = {'one': pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two': pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df=pd.DataFrame(d)
df


Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [130]:
df['three'] = df['one'] * df['two']
df

Unnamed: 0,one,two,three
a,1.0,1.0,1.0
b,2.0,2.0,4.0
c,3.0,3.0,9.0
d,,4.0,


In [131]:
del df['two']
three = df.pop('three')
df

Unnamed: 0,one
a,1.0
b,2.0
c,3.0
d,


In [132]:
three

a    1.0
b    4.0
c    9.0
d    NaN
Name: three, dtype: float64

When inserting a scalar value, it will naturally be propagated to fill the column:



In [133]:
df['flag'] = df['one'] > 2
df['foo'] = 'bar'
df

Unnamed: 0,one,flag,foo
a,1.0,False,bar
b,2.0,False,bar
c,3.0,True,bar
d,,False,bar


We can select row by label using **df.loc[label]** or select row by integer location using **df.iloc[loc]**:





In [134]:
df.iloc[1]

one         2
flag    False
foo       bar
Name: b, dtype: object

In [135]:
df.loc['b']

one         2
flag    False
foo       bar
Name: b, dtype: object

To **transpose**, access the T attribute (also the transpose function)



In [136]:
df.T

Unnamed: 0,a,b,c,d
one,1,2,3,
flag,False,False,True,False
foo,bar,bar,bar,bar


Very large DataFrames will be truncated to display them in the console. You can also get a summary using info(). 

I will read in tags.csv file from MovieLens dataset as an example.




In [137]:
csv_file='/content/drive/My Drive/Colab Notebooks/ml-latest-small/tags.csv'
tags=pd.read_csv(csv_file)
print(tags)

      userId  movieId               tag   timestamp
0          2    60756             funny  1445714994
1          2    60756   Highly quotable  1445714996
2          2    60756      will ferrell  1445714992
3          2    89774      Boxing story  1445715207
4          2    89774               MMA  1445715200
...      ...      ...               ...         ...
3678     606     7382         for katie  1171234019
3679     606     7936           austere  1173392334
3680     610     3265            gun fu  1493843984
3681     610     3265  heroic bloodshed  1493843978
3682     610   168248  Heroic Bloodshed  1493844270

[3683 rows x 4 columns]


In [138]:
tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3683 entries, 0 to 3682
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   userId     3683 non-null   int64 
 1   movieId    3683 non-null   int64 
 2   tag        3683 non-null   object
 3   timestamp  3683 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 115.2+ KB


## **2. Data Manipulation**

*In this part, we will find out how to use pandas to manipulate data, as well as how pandas deal with missing data and operations with fill values.*

##**2.1 Missing values**



In [139]:
tags

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200
...,...,...,...,...
3678,606,7382,for katie,1171234019
3679,606,7936,austere,1173392334
3680,610,3265,gun fu,1493843984
3681,610,3265,heroic bloodshed,1493843978


In [140]:
tags.head(2)

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996


In [141]:
tags.tail(3)

Unnamed: 0,userId,movieId,tag,timestamp
3680,610,3265,gun fu,1493843984
3681,610,3265,heroic bloodshed,1493843978
3682,610,168248,Heroic Bloodshed,1493844270


We can sort the dataframe by any index or column:

In [142]:
tags.sort_values(by='movieId',ascending=True)

Unnamed: 0,userId,movieId,tag,timestamp
2886,567,1,fun,1525286013
981,474,1,pixar,1137206825
629,336,1,pixar,1139045764
35,62,2,Robin Williams,1528843907
34,62,2,magic board game,1528843932
...,...,...,...,...
402,62,187595,star wars,1528934552
528,184,193565,comedy,1537098587
527,184,193565,anime,1537098582
530,184,193565,remaster,1537098592


We can use a single column’s values to select data, that's pretty like query functions in SQL.




In [143]:
tags[tags['movieId']<10]

Unnamed: 0,userId,movieId,tag,timestamp
33,62,2,fantasy,1528843929
34,62,2,magic board game,1528843932
35,62,2,Robin Williams,1528843907
561,289,3,moldy,1143424860
562,289,3,old,1143424860
629,336,1,pixar,1139045764
981,474,1,pixar,1137206825
982,474,2,game,1137375552
983,474,5,pregnancy,1137373903
984,474,5,remake,1137373903


Using the isin() method for filtering:

In [144]:
tags[tags['tag'].isin(['magic board game','fantasy','fun'])]

Unnamed: 0,userId,movieId,tag,timestamp
33,62,2,fantasy,1528843929
34,62,2,magic board game,1528843932
106,62,7153,fantasy,1528152556
196,62,59501,fantasy,1525637507
869,424,4993,fantasy,1457901156
957,424,106489,fantasy,1457901210
2776,509,80834,fantasy,1435992979
2886,567,1,fun,1525286013
3166,567,89745,fun,1525285537
3203,567,108932,fun,1525283672


To deal with missing data, pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations.


In [145]:
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
                  columns=['one', 'two', 'three'])
df['four'] = 'bar'
df['five'] = df['one'] > 0
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df2

Unnamed: 0,one,two,three,four,five
a,-0.629893,1.294311,1.052424,bar,False
b,,,,,
c,0.069158,-0.850715,-0.409728,bar,True
d,,,,,
e,-0.388401,1.054568,-0.261238,bar,False
f,0.287233,0.037376,0.091357,bar,True
g,,,,,
h,0.172148,-0.921536,-2.501539,bar,True


In [146]:
df2['one']

a   -0.629893
b         NaN
c    0.069158
d         NaN
e   -0.388401
f    0.287233
g         NaN
h    0.172148
Name: one, dtype: float64

To make detecting missing values easier (and across different array dtypes), pandas provides the isna() and notna() functions, which are also methods on Series and DataFrame objects:



In [147]:
pd.isna(df2['one'])

a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool

In [148]:
df2['one'].notna()

a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: one, dtype: bool

In [149]:
df2.isna()

Unnamed: 0,one,two,three,four,five
a,False,False,False,False,False
b,True,True,True,True,True
c,False,False,False,False,False
d,True,True,True,True,True
e,False,False,False,False,False
f,False,False,False,False,False
g,True,True,True,True,True
h,False,False,False,False,False


In [150]:
a = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'],
                  columns=['one', 'two', 'three'])
a['three']=np.nan
a

Unnamed: 0,one,two,three
a,0.599753,0.299737,
c,-1.703983,-0.9131,
e,-0.457876,0.226017,
f,-1.491071,-0.558886,
h,-0.154529,0.952042,


In [151]:
b = pd.DataFrame(np.random.randn(5, 2), index=['a', 'c', 'e', 'f', 'h'],
                  columns=['one', 'two'])

b.loc[ ['a','c','f'],['one','two']]=np.nan
b

Unnamed: 0,one,two
a,,
c,,
e,0.130947,0.949652
f,,
h,0.853039,0.233902


In [152]:
a+b

Unnamed: 0,one,three,two
a,,,
c,,,
e,-0.326929,,1.175669
f,,,
h,0.69851,,1.185944


We can see that the missing values propagate naturally through arithmetic operations between pandas objects.


Cumulative methods like cumsum() and cumprod() ignore NA values by default, but preserve them in the resulting arrays. To override this behaviour and include NA values, use skipna=False.




In [153]:
b['one'].sum()


0.9839865215740131

In [154]:
b.cumsum()

Unnamed: 0,one,two
a,,
c,,
e,0.130947,0.949652
f,,
h,0.983987,1.183554


In [155]:
b.cumsum(skipna=False)

Unnamed: 0,one,two
a,,
c,,
e,,
f,,
h,,


Still, we can use fillna() to “fill in” NA values with non-NA data in a couple of ways, which we illustrate:



In [156]:
b

Unnamed: 0,one,two
a,,
c,,
e,0.130947,0.949652
f,,
h,0.853039,0.233902


In [157]:
b.fillna(0)

Unnamed: 0,one,two
a,0.0,0.0
c,0.0,0.0
e,0.130947,0.949652
f,0.0,0.0
h,0.853039,0.233902


In [158]:
b['one'].fillna('missing')

a     missing
c     missing
e    0.130947
f     missing
h    0.853039
Name: one, dtype: object

###**2.2 Merge data**
pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.




In [160]:
df = pd.DataFrame(np.random.randn(10, 4))
df

Unnamed: 0,0,1,2,3
0,0.126684,-0.524689,-1.138657,0.554813
1,-1.550623,-0.269407,-0.127689,1.561491
2,0.554198,0.762847,-0.429282,-0.156099
3,0.649849,-0.763773,-0.21504,-0.228311
4,0.769817,0.968132,-0.542193,0.566636
5,-2.160228,-0.892782,-1.037957,0.373218
6,0.03941,-0.685537,-0.293669,0.458118
7,2.066613,-0.140754,1.082805,-1.722379
8,1.084419,-1.879058,-1.003877,-0.48996
9,1.406796,-1.498176,-0.253376,0.089528


In [162]:
pieces = [df[:2], df[3:5], df[7:]]
pieces

[          0         1         2         3
 0  0.126684 -0.524689 -1.138657  0.554813
 1 -1.550623 -0.269407 -0.127689  1.561491,
           0         1         2         3
 3  0.649849 -0.763773 -0.215040 -0.228311
 4  0.769817  0.968132 -0.542193  0.566636,
           0         1         2         3
 7  2.066613 -0.140754  1.082805 -1.722379
 8  1.084419 -1.879058 -1.003877 -0.489960
 9  1.406796 -1.498176 -0.253376  0.089528]

In [164]:
pd.concat(pieces)

Unnamed: 0,0,1,2,3
0,0.126684,-0.524689,-1.138657,0.554813
1,-1.550623,-0.269407,-0.127689,1.561491
3,0.649849,-0.763773,-0.21504,-0.228311
4,0.769817,0.968132,-0.542193,0.566636
7,2.066613,-0.140754,1.082805,-1.722379
8,1.084419,-1.879058,-1.003877,-0.48996
9,1.406796,-1.498176,-0.253376,0.089528


pandas also provides SQL style merges for joining data.

In [0]:
left = pd.DataFrame({'key': ['foo', 'hello'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['hello', 'world'], 'rval': [4, 5]})

In [171]:
left

Unnamed: 0,key,lval
0,foo,1
1,hello,2


In [172]:
right

Unnamed: 0,key,rval
0,hello,4
1,world,5


In [173]:
pd.merge(left, right, on='key')

Unnamed: 0,key,lval,rval
0,hello,2,4


###**2.3 Grouping data**


By “**group by**” we are referring to a process involving one or more of the following steps:

**Splitting** the data into groups based on some criteria

**Applying** a function to each group independently

**Combining** the results into a data structure

In [174]:
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                         'foo', 'bar', 'foo', 'foo'],
                   'B': ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'C': np.random.randn(8),
                   'D': np.random.randn(8)})
df

Unnamed: 0,A,B,C,D
0,foo,one,-0.2509,-2.07219
1,bar,one,-0.164139,2.612513
2,foo,two,-0.045248,0.818408
3,bar,three,1.276875,-0.651911
4,foo,two,-0.491221,-1.353502
5,bar,two,-1.000299,1.105674
6,foo,one,0.729921,0.750072
7,foo,three,-0.549949,1.004447


In [175]:
df.groupby('A').sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,0.112438,3.066277
foo,-0.607397,-0.852765


In [176]:
df.groupby(['A', 'B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-0.164139,2.612513
bar,three,1.276875,-0.651911
bar,two,-1.000299,1.105674
foo,one,0.479021,-1.322118
foo,three,-0.549949,1.004447
foo,two,-0.536469,-0.535094


###**2.4 Pivot tables**

The function pivot_table() can be used to create spreadsheet-style pivot tables. 

While pivot() provides general purpose pivoting with various data types (strings, numerics, etc.), pandas also provides pivot_table() for pivoting with aggregation of numeric data.



In [177]:
df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
                   'B': ['A', 'B', 'C'] * 4,
                   'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                   'D': np.random.randn(12),
                   'E': np.random.randn(12)})
df

Unnamed: 0,A,B,C,D,E
0,one,A,foo,-1.31611,0.474052
1,one,B,foo,-0.414337,0.153004
2,two,C,foo,0.404791,-0.974483
3,three,A,bar,0.135615,-0.381174
4,one,B,bar,0.328868,0.312974
5,one,C,bar,1.01991,-0.193826
6,two,A,foo,-0.361396,-0.051033
7,three,B,foo,-0.063427,1.153963
8,one,C,foo,0.734983,-0.089854
9,one,A,bar,-0.308505,-1.146276


In [178]:
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

Unnamed: 0_level_0,C,bar,foo
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,-0.308505,-1.31611
one,B,0.328868,-0.414337
one,C,1.01991,0.734983
three,A,0.135615,
three,B,,-0.063427
three,C,1.718443,
two,A,,-0.361396
two,B,-0.146381,
two,C,,0.404791


## **3. Read/Write file**

*The Series and DataFrame objects have an instance method to_csv which allows storing the contents of the object as a comma-separated-values file. *

###**3.1 JSON**

Read and write JSON format files and strings.




In [180]:
dfj = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))
dfj

Unnamed: 0,A,B
0,0.282548,-0.482054
1,-1.085391,-1.058356
2,-1.54764,-0.193817
3,-0.321072,-0.273076
4,0.908759,-0.745023


In [181]:
json = dfj.to_json()
json

'{"A":{"0":0.282548238,"1":-1.085390589,"2":-1.5476404124,"3":-0.3210719647,"4":0.9087590384},"B":{"0":-0.4820535764,"1":-1.0583564838,"2":-0.193817005,"3":-0.2730759937,"4":-0.745022721}}'

Reading a JSON string to pandas object can take a number of parameters. The parser will try to parse a DataFrame if typ is not supplied or is None. To explicitly force Series parsing, pass typ=series



In [182]:
pd.read_json(json)

Unnamed: 0,A,B
0,0.282548,-0.482054
1,-1.085391,-1.058356
2,-1.54764,-0.193817
3,-0.321072,-0.273076
4,0.908759,-0.745023


###**3.2 CSV & text files**

Read and write JSON format files and strings.




In [186]:
csv_file='/content/drive/My Drive/Colab Notebooks/ml-latest-small/tags.csv'
tags=pd.read_csv(csv_file)
print(tags)

      userId  movieId               tag   timestamp
0          2    60756             funny  1445714994
1          2    60756   Highly quotable  1445714996
2          2    60756      will ferrell  1445714992
3          2    89774      Boxing story  1445715207
4          2    89774               MMA  1445715200
...      ...      ...               ...         ...
3678     606     7382         for katie  1171234019
3679     606     7936           austere  1173392334
3680     610     3265            gun fu  1493843984
3681     610     3265  heroic bloodshed  1493843978
3682     610   168248  Heroic Bloodshed  1493844270

[3683 rows x 4 columns]


In [0]:
dfj.to_csv('foo.csv')

You will find a 'foo.csv' file automatically generated at the same level folder.