# Data Manupulation with Pandas

In [2]:
'''
Pandas is a package built on top of NumPy, and provides an efficient implementation of a
 DataFrame. DataFrames are multi-dimensional arrays with attached row and column labels, often
 with heterogeneous types and/or missing data.
 Pandas, and it’s Series and DataFrame objects, builds on the NumPy array structure and provides
 efficient access to data cleaning tasks that occupy much of a data scientist’s time

  Pandas Objects
 There are 3 fundamental Pandas data structures:
 • Series
 • DataFrame
 • Index

'''

'\nPandas is a package built on top of NumPy, and provides an efficient implementation of a\n DataFrame. DataFrames are multi-dimensional arrays with attached row and column labels, often\n with heterogeneous types and/or missing data.\n Pandas, and it’s Series and DataFrame objects, builds on the NumPy array structure and provides\n efficient access to data cleaning tasks that occupy much of a data scientist’s time\n\n  Pandas Objects\n There are 3 fundamental Pandas data structures:\n • Series\n • DataFrame\n • Index\n\n'

## Pandas Series Object

In [4]:
import numpy as np
data = np.array([0.25, 0.5, 0.75, 1.0])
data

array([0.25, 0.5 , 0.75, 1.  ])

In [5]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [6]:
type(data)

pandas.core.series.Series

In [7]:
 data.values # accessing values of the Series

array([0.25, 0.5 , 0.75, 1.  ])

In [8]:
'''
 Series as generalized NumPy Array
 While Series looks just like regular NumPy array, the essential difference is that
 • NumPy array has implicitly defined integer indices
 • Pandas Series has explicitly defined integer indices associated with the values
 The explicit index gives the Series object additional capabilites. For instance, the index does not
 need to be an integer, but can be values of any data type.
 For example, we can use string as index in Series.
 '''

'\n Series as generalized NumPy Array\n While Series looks just like regular NumPy array, the essential difference is that\n • NumPy array has implicitly defined integer indices\n • Pandas Series has explicitly defined integer indices associated with the values\n The explicit index gives the Series object additional capabilites. For instance, the index does not\n need to be an integer, but can be values of any data type.\n For example, we can use string as index in Series.\n '

In [9]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['1', '2', '3', '4'])
data

1    0.25
2    0.50
3    0.75
4    1.00
dtype: float64

In [10]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [11]:
data['a']  # accessing value using string index

np.float64(0.25)

In [12]:
'''
 Series as specialized dictionary
 Since you can use any data type as index, Pandas Series can be thought of as a specialized dictionary.
 Let’s construct a Series object directly from a Python dictionary.
 '''

'\n Series as specialized dictionary\n Since you can use any data type as index, Pandas Series can be thought of as a specialized dictionary.\n Let’s construct a Series object directly from a Python dictionary.\n '

In [13]:
population_dict = {'California': 38,
                   'Texes': 26,
                   'New York': 20,
                   'Florida': 19,
                   'Illinois': 13
                  }         # population in million
population = pd.Series(population_dict)
population

California    38
Texes         26
New York      20
Florida       19
Illinois      13
dtype: int64

In [14]:
print(population['Illinois'])
# accessing value using index

13


In [15]:
population['California': 'New York']

California    38
Texes         26
New York      20
dtype: int64

In [16]:
# conturctiong series objects

pd.Series([1,2,3])

0    1
1    2
2    3
dtype: int64

In [17]:
pd.Series(7, index=[10,20,30,30]) ## note that duplicate indices are allowed
# data can be a scalar, which is repeated to fill the specified index

10    7
20    7
30    7
30    7
dtype: int64

In [18]:
 # data can be a dictionary, in which index defaults to dictionary keys
 pd.Series({2:'a', 5:'b', 3:'c'})

2    a
5    b
3    c
dtype: object

In [19]:
# index can be explicitly set if preferred
pd.Series({2:'a', 5:'b', 3:'c'}, index=[3, 2]) 
# index 5 is discarded since it␣is not in the index list

3    c
2    a
dtype: object

## Pandas DataFrame Objects

In [21]:
'''
 Pandas DataFrame Object
 The DataFrame can be considered as
 • generalization of a NumPy array, or
 • specialization of a Python dictionary
 DataFrame as a generalized NumPy Array
 Just like Series is an analog of a one dimensional array with flexible indices,
 a DataFrame is an analog of two-dimensional array with both flexible row indices and flexible
 column indices
 '''

'\n Pandas DataFrame Object\n The DataFrame can be considered as\n • generalization of a NumPy array, or\n • specialization of a Python dictionary\n DataFrame as a generalized NumPy Array\n Just like Series is an analog of a one dimensional array with flexible indices,\n a DataFrame is an analog of two-dimensional array with both flexible row indices and flexible\n column indices\n '

In [22]:
area_dict = {'California' : 423,
             'Texas': 695,
             'New York': 141,
             'Florida': 170,
             'Illinois': 150
            }
# area in thousand sq. miles

In [23]:
area = pd.Series(area_dict)
area


California    423
Texas         695
New York      141
Florida       170
Illinois      150
dtype: int64

In [24]:
states = pd.DataFrame({'population': population,
                       'area': area
                      })
print(states)
# We can use the Series area and population (we defined above) to create a single two-dimensionalobject.

            population   area
California        38.0  423.0
Florida           19.0  170.0
Illinois          13.0  150.0
New York          20.0  141.0
Texas              NaN  695.0
Texes             26.0    NaN


In [25]:
type(states)

pandas.core.frame.DataFrame

In [26]:
states.index
# show indices

Index(['California', 'Florida', 'Illinois', 'New York', 'Texas', 'Texes'], dtype='object')

In [27]:
states.columns
# shows colums of the dataFrame

Index(['population', 'area'], dtype='object')

In [28]:
states['area']

California    423.0
Florida       170.0
Illinois      150.0
New York      141.0
Texas         695.0
Texes           NaN
Name: area, dtype: float64

In [29]:
population

California    38
Texes         26
New York      20
Florida       19
Illinois      13
dtype: int64

In [30]:
pd.DataFrame(population, columns = ['population'])
 # single column DataFrame can be created from a single Series

Unnamed: 0,population
California,38
Texes,26
New York,20
Florida,19
Illinois,13


In [31]:
# from a list of dicts
data = [{'a':i, 'b':2*i} for i in range(10)]
data

[{'a': 0, 'b': 0},
 {'a': 1, 'b': 2},
 {'a': 2, 'b': 4},
 {'a': 3, 'b': 6},
 {'a': 4, 'b': 8},
 {'a': 5, 'b': 10},
 {'a': 6, 'b': 12},
 {'a': 7, 'b': 14},
 {'a': 8, 'b': 16},
 {'a': 9, 'b': 18}]

In [32]:
# Even if some keys in the ict are missing, pandas fills them up with NaN ( Not a Number)
pd.DataFrame([{'a':2, 'b':5}, {'b':7, 'c':1}])

Unnamed: 0,a,b,c
0,2.0,5,
1,,7,1.0


In [33]:
 # from a dict of series objects
 pd.DataFrame({'population': population,
 'area': area
 })

Unnamed: 0,population,area
California,38.0,423.0
Florida,19.0,170.0
Illinois,13.0,150.0
New York,20.0,141.0
Texas,,695.0
Texes,26.0,


In [34]:
# from a 2-dimensional numpy array
import numpy as np
pd.DataFrame(np.random.rand(3, 2),
columns = ['foo', 'bar'],
index = ['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,0.284313,0.447049
b,0.770383,0.214399
c,0.345266,0.161567


In [35]:
# from a numpy structured array
pd.DataFrame(np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')]))

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


## Pandas Index Object

In [37]:
# The Pandas Index object can be thought of as an immutable array.
# Let’s construct an Index object from a list of integers.
import pandas as pd
ind = pd.Index([1, 5, 3, 9, 7])
ind

Index([1, 5, 3, 9, 7], dtype='int64')

In [38]:
# can use indexing notation to get values
ind[3]

np.int64(9)

In [39]:
ind[::-1] # slicing also works

Index([7, 9, 3, 5, 1], dtype='int64')

In [40]:
 indA= pd.Index([1,3,8,4, 9])
 indB= pd.Index([2,3,7,5, 9])

 #intersectionofindAandindB
indA& indB

Index([0, 3, 0, 4, 9], dtype='int64')

In [41]:
indA.intersection(indB)

Index([3, 9], dtype='int64')

In [42]:
indA | indB
# union of indA and indB

Index([3, 3, 15, 5, 9], dtype='int64')

In [43]:
indA.union(indB)

Index([1, 2, 3, 4, 5, 7, 8, 9], dtype='int64')

In [44]:
'''
 Data Selection in Series
 The Series object is in many ways like a one-dimensional NumPy Array, and in many ways like a
 standard Python dictionary.
 Series as a Dictionary
 Just like a dictionary, the Series object provides a mapping from a collection of keys to a collection
 of values
 '''

'\n Data Selection in Series\n The Series object is in many ways like a one-dimensional NumPy Array, and in many ways like a\n standard Python dictionary.\n Series as a Dictionary\n Just like a dictionary, the Series object provides a mapping from a collection of keys to a collection\n of values\n '

In [45]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [46]:
'd' in data

True

In [47]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [48]:
list(data.items())
# get all items

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [49]:
data['e'] = 1.25
data
 # new key (index) and value pair added

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

In [50]:
data['d'] = 1.50
data
 #  key (index) and value pair updated

a    0.25
b    0.50
c    0.75
d    1.50
e    1.25
dtype: float64

In [51]:
'''
 Indexers: loc and iloc
 Pandas provides special indexer attributes that explicitly expose certain indexing schemes.
 The loc attribute allows indexing and slicing that always references the explicit index.
 '''

'\n Indexers: loc and iloc\n Pandas provides special indexer attributes that explicitly expose certain indexing schemes.\n The loc attribute allows indexing and slicing that always references the explicit index.\n '

In [52]:
data = pd.Series(['x','y','z'], index = [1,3,5])
data

1    x
3    y
5    z
dtype: object

In [53]:
data[1]

'x'

In [54]:
data.iloc[1]
# referencing by implicit index using iloc

'y'

In [55]:
data.loc[1] 
# using explicit indexing using lo

'x'

In [56]:
 # without iloc
data[1] # explicit indexing

'x'

In [57]:
#  Explicit is always better than implicit. So it is better to use loc and iloc to make explicit which indexing is intended.

In [58]:
'''
 Data Selection in DataFrame
 DataFrame acts in many ways as:
 • atwo-dimensional or structured array
 • or, a dictionary of Series sharing the same index
 DataFrame as a dictionary
 Let’s see how DataFrame is analogous to dictionary.
 '''

'\n Data Selection in DataFrame\n DataFrame acts in many ways as:\n • atwo-dimensional or structured array\n • or, a dictionary of Series sharing the same index\n DataFrame as a dictionary\n Let’s see how DataFrame is analogous to dictionary.\n '

In [59]:
 area = pd.Series({'California': 424,
 'Texas': 696,
 'New York': 141,
 'Florida': 170,
 'Illinois': 150
 }) # thousand sq miles

In [60]:
 population = pd.Series({'California': 38,
 'Texas': 26,
 'New York': 19,
 'Florida': 20,
 'Illinois': 129
 }) # million

In [61]:
data = pd.DataFrame({'area': area,
                     'population': population
                    })
# data frame as dictionary
data


Unnamed: 0,area,population
California,424,38
Texas,696,26
New York,141,19
Florida,170,20
Illinois,150,129


In [62]:
# adding a new column 'density'
data['density'] = data['population'] / data['area']
data

Unnamed: 0,area,population,density
California,424,38,0.089623
Texas,696,26,0.037356
New York,141,19,0.134752
Florida,170,20,0.117647
Illinois,150,129,0.86


In [63]:
# DataFrame as a 2-dimenstional array
# DataFrame can also be viewed as a two-dimensional array
data.values

array([[4.24000000e+02, 3.80000000e+01, 8.96226415e-02],
       [6.96000000e+02, 2.60000000e+01, 3.73563218e-02],
       [1.41000000e+02, 1.90000000e+01, 1.34751773e-01],
       [1.70000000e+02, 2.00000000e+01, 1.17647059e-01],
       [1.50000000e+02, 1.29000000e+02, 8.60000000e-01]])

In [64]:
data.T
# tansposing the dataframe (rows become column; and columns become rows)

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,424.0,696.0,141.0,170.0,150.0
population,38.0,26.0,19.0,20.0,129.0
density,0.089623,0.037356,0.134752,0.117647,0.86


In [65]:
data.values[1]
# return the first row of the array

array([6.96000000e+02, 2.60000000e+01, 3.73563218e-02])

In [66]:
data

Unnamed: 0,area,population,density
California,424,38,0.089623
Texas,696,26,0.037356
New York,141,19,0.134752
Florida,170,20,0.117647
Illinois,150,129,0.86


In [67]:
# implicit indexing
data.iloc[:3, :2] # rows 0, 1, 2 and columns 0, 1

Unnamed: 0,area,population
California,424,38
Texas,696,26
New York,141,19


In [68]:
# explicit indexing
data.loc[:'New York', :'population']
# rows upto "New York" and columns up to 'population

Unnamed: 0,area,population
California,424,38
Texas,696,26
New York,141,19


In [69]:
# explicit indexing
data.loc[:'Florida', :'area']
# rows upto 'Florida' and coulums upto 'area'

Unnamed: 0,area
California,424
Texas,696
New York,141
Florida,170


In [70]:
# select rows with density > 0.1( masking); cols => 'population','density'(fancy indexing)
data.loc[data.density > 0.1, ['population', 'density']]

Unnamed: 0,population,density
New York,19,0.134752
Florida,20,0.117647
Illinois,129,0.86


In [71]:
data

Unnamed: 0,area,population,density
California,424,38,0.089623
Texas,696,26,0.037356
New York,141,19,0.134752
Florida,170,20,0.117647
Illinois,150,129,0.86


In [72]:
data.iloc[0,2] = 0.99 # update row 0, col 2 to a new value
data

Unnamed: 0,area,population,density
California,424,38,0.99
Texas,696,26,0.037356
New York,141,19,0.134752
Florida,170,20,0.117647
Illinois,150,129,0.86


In [73]:
# data.area and data['area'] returns same output
data['area'] 
# indexing refers to colums

California    424
Texas         696
New York      141
Florida       170
Illinois      150
Name: area, dtype: int64

In [74]:
# slicing refers to rows
data['Texas':'Florida']

Unnamed: 0,area,population,density
Texas,696,26,0.037356
New York,141,19,0.134752
Florida,170,20,0.117647


In [75]:
data[data.area <200]

Unnamed: 0,area,population,density
New York,141,19,0.134752
Florida,170,20,0.117647
Illinois,150,129,0.86


## Operating on Data in Pandas

In [77]:
'''
As you saw, NumPy has the ability of perform quick element-wise operations, such as arithmetic
 operations as well as more sophisticated operations, such as trigonometric functions, exponentials,
 logarithms etc. Pandas inherits much of this functionality from NumPy.
 In Pandas, for unary operations like negation, and trigonometric functions, the ufuncs preserve
 index and column labels in the output. For binary operations, such as addition and multiplication,
 Pandas automatically assigns indices when passing the objects to the ufuncs.
 This means that keeping the context of data and combining data from different sources- both
 potentially error-prone tasks with Numpy Arrays- become essentially fool-proof ones with Pandas.
 '''

'\nAs you saw, NumPy has the ability of perform quick element-wise operations, such as arithmetic\n operations as well as more sophisticated operations, such as trigonometric functions, exponentials,\n logarithms etc. Pandas inherits much of this functionality from NumPy.\n In Pandas, for unary operations like negation, and trigonometric functions, the ufuncs preserve\n index and column labels in the output. For binary operations, such as addition and multiplication,\n Pandas automatically assigns indices when passing the objects to the ufuncs.\n This means that keeping the context of data and combining data from different sources- both\n potentially error-prone tasks with Numpy Arrays- become essentially fool-proof ones with Pandas.\n '

In [78]:
# Ufuncs: Index Preservation
# Any NumPy ufuncs works on Pandas Series and DataFrame objects.

import numpy as np
import pandas as pd

# creating a random number generator with fixed seed value of '42'
rng = np.random.RandomState(42)
# generates an array of 4 random integers betwween 0(inclusive) and 10(exclusive)
ser = pd.Series(rng.randint(0, 10, 4)) # a series object
ser

0    6
1    3
2    7
3    4
dtype: int32

In [79]:
# 'numpy' is used for numerical operations and manipulation.
# 'pands' is used for data manipulation and analysis, particulary with tabular data


In [80]:
df = pd.DataFrame(rng.randint(0, 10, (3,4)), columns = ['A', 'B', 'C', 'D'])
# dataframe object
#The parameters (0, 10, (3, 4)) specify that the integers should be between 0 (inclusive) and 10 (exclusive), and the shape of the array should be (3, 4), meaning 3 rows and 4 columns.
df

Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4


In [81]:
np.exp(ser)
# exponential (e^element) of each element in the series

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

In [82]:
# a bit more complex calculation 
np.sin(df * np.pi/4)
# sine of each element multipled by pi/4

Unnamed: 0,A,B,C,D
0,-1.0,0.7071068,1.0,-1.0
1,-0.707107,1.224647e-16,0.707107,-0.7071068
2,-0.707107,1.0,-0.707107,1.224647e-16


In [163]:
df

Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4
