# Introduction to pandas Data Structures

## Series

**A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index.**

In [1]:
import pandas as pd
import numpy as np

In [2]:
obj = pd.Series([1,-2,0,-5])
#obj = pd.Series([1,-2,0,-5],index = ['a','b','c','d']) #with indexing

In [3]:
obj.values,obj.index

(array([ 1, -2,  0, -5], dtype=int64), RangeIndex(start=0, stop=4, step=1))

In [4]:
obj

0    1
1   -2
2    0
3   -5
dtype: int64

In [5]:
obj[0]

1

In [6]:
obj.index = ['a','b','c','d']

In [7]:
obj.values

array([ 1, -2,  0, -5], dtype=int64)

In [8]:
obj.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [9]:
obj

a    1
b   -2
c    0
d   -5
dtype: int64

In [10]:
obj['a'],obj.a

(1, 1)

In [11]:
obj['a':'c']

a    1
b   -2
c    0
dtype: int64

Using NumPy functions or NumPy-like operations, such as filtering with a boolean
array, scalar multiplication, or applying math functions, will preserve the index-value

In [12]:
obj.values

array([ 1, -2,  0, -5], dtype=int64)

In [13]:
obj2 = pd.Series([1,2,-2,0],index=['c','a','b','d'])

In [14]:
obj[ obj2 > 0 ]

a    1
c    0
dtype: int64

same length and index should contain on the both array. if indexes are scattered there is no problem.

In [15]:
obj2 * 2

c    2
a    4
b   -4
d    0
dtype: int64

In [16]:
'd' in obj2

True

In [17]:
np.exp(obj2)

c    2.718282
a    7.389056
b    0.135335
d    1.000000
dtype: float64

- **Python dictionary can be passed to the Series**

In [18]:
dic = {'Name':'Towhid','ID':17101135,'Roll':135}

In [19]:
Sdata1 = pd.Series(dic)
Sdata1

Name      Towhid
ID      17101135
Roll         135
dtype: object

In [20]:
Sdata1.Name,Sdata1['ID']

('Towhid', 17101135)

- when we are passing a dic in series, the keys will appear in a sorted way in the series. But we can order them by passing index to the series. if we directly gives the index name like index= ['firstName','Reg','Roll'] it will replace the name. if we pass a list of name then it will check the keys name to the dic and if not present then it will set a Nan value for it.

In [21]:
states = ['firstName','Reg','Roll']
Sdata = pd.Series(dic,index=states)
Sdata

firstName    NaN
Reg          NaN
Roll         135
dtype: object

In [22]:
Sdata.isnull().value_counts()

True     2
False    1
dtype: int64

- **It automaically aligns by index label in arithmetic operation** only key named 'Roll' is present on the both series so thats why Roll keys value is added.

In [23]:
Sdata1 + Sdata

ID           NaN
Name         NaN
Reg          NaN
Roll         270
firstName    NaN
dtype: object

## DataFrame

### Dataframe creation

In order to construct a dataframe we need to create a dictionary of equal length list or numpy array

In [24]:
data = {  'Name': ['Towhid','Hasib','Tamjid','Sadi','Nahian'],'ID':[17101135,17101118,17101152,17101150,17101114],
        'Roll':[135,118,152,150,114],'Year':['4th','4th','4th','4th','4th'],'Sex':['Male','Male','Male','Male','Male']
        }

In [25]:
df1 = pd.DataFrame(data)
df1

Unnamed: 0,Name,ID,Roll,Year,Sex
0,Towhid,17101135,135,4th,Male
1,Hasib,17101118,118,4th,Male
2,Tamjid,17101152,152,4th,Male
3,Sadi,17101150,150,4th,Male
4,Nahian,17101114,114,4th,Male


In [26]:
df = pd.DataFrame( [ ['Towhid',17101135,135,'4th','Male'], ['Hasib',17101118,118,'4th','Male'],
                     ['Tamjid',171011152,152,'4th','Male'], ['Sadi',17101150,150,'4th','Male'],
                     ['Nahian',17101114,114,'4th','Male'],['Sabuj',17101139,139,'4th','Male']
                      ])
df.columns = ['Name','ID','Roll','Year','Sex']
df.index = ['Student1','Student2','Student3','Student4','Student5','Student6']
df

Unnamed: 0,Name,ID,Roll,Year,Sex
Student1,Towhid,17101135,135,4th,Male
Student2,Hasib,17101118,118,4th,Male
Student3,Tamjid,171011152,152,4th,Male
Student4,Sadi,17101150,150,4th,Male
Student5,Nahian,17101114,114,4th,Male
Student6,Sabuj,17101139,139,4th,Male


In [27]:
df.head() #by default it shows first five rows.

Unnamed: 0,Name,ID,Roll,Year,Sex
Student1,Towhid,17101135,135,4th,Male
Student2,Hasib,17101118,118,4th,Male
Student3,Tamjid,171011152,152,4th,Male
Student4,Sadi,17101150,150,4th,Male
Student5,Nahian,17101114,114,4th,Male


In [28]:
df.tail()

Unnamed: 0,Name,ID,Roll,Year,Sex
Student2,Hasib,17101118,118,4th,Male
Student3,Tamjid,171011152,152,4th,Male
Student4,Sadi,17101150,150,4th,Male
Student5,Nahian,17101114,114,4th,Male
Student6,Sabuj,17101139,139,4th,Male


### Accessing (row,col) values

In [29]:
df.Name

Student1    Towhid
Student2     Hasib
Student3    Tamjid
Student4      Sadi
Student5    Nahian
Student6     Sabuj
Name: Name, dtype: object

In [30]:
df['Roll']

Student1    135
Student2    118
Student3    152
Student4    150
Student5    114
Student6    139
Name: Roll, dtype: int64

In [31]:
df[ df.ID >= 17101150 ]

Unnamed: 0,Name,ID,Roll,Year,Sex
Student3,Tamjid,171011152,152,4th,Male
Student4,Sadi,17101150,150,4th,Male


In [32]:
df['Sex'] = 'male'
df.head(2)

Unnamed: 0,Name,ID,Roll,Year,Sex
Student1,Towhid,17101135,135,4th,male
Student2,Hasib,17101118,118,4th,male


### assigning nparray to column

In [33]:
df.Roll = np.arange(6)
df

Unnamed: 0,Name,ID,Roll,Year,Sex
Student1,Towhid,17101135,0,4th,male
Student2,Hasib,17101118,1,4th,male
Student3,Tamjid,171011152,2,4th,male
Student4,Sadi,17101150,3,4th,male
Student5,Nahian,17101114,4,4th,male
Student6,Sabuj,17101139,5,4th,male


In [34]:
df[ ['Name','Roll'] ] #can not access multiple col for single row.

Unnamed: 0,Name,Roll
Student1,Towhid,0
Student2,Hasib,1
Student3,Tamjid,2
Student4,Sadi,3
Student5,Nahian,4
Student6,Sabuj,5


### df.loc ( value assign, easy (row,col) access)

In [35]:
df.loc['Student1'][ ['Name','Roll'] ]

Name    Towhid
Roll         0
Name: Student1, dtype: object

In [36]:
df.loc['Student1',['Name','Roll'] ]

Name    Towhid
Roll         0
Name: Student1, dtype: object

In [37]:
#create a new col and assign 1 to the conditional position
df.loc[ df.ID >= 17101150,'Greater' ] = 1 

In [38]:
df

Unnamed: 0,Name,ID,Roll,Year,Sex,Greater
Student1,Towhid,17101135,0,4th,male,
Student2,Hasib,17101118,1,4th,male,
Student3,Tamjid,171011152,2,4th,male,1.0
Student4,Sadi,17101150,3,4th,male,1.0
Student5,Nahian,17101114,4,4th,male,
Student6,Sabuj,17101139,5,4th,male,


In [39]:
df.loc[df.Greater.isnull(),'Greater'] = 0
df

Unnamed: 0,Name,ID,Roll,Year,Sex,Greater
Student1,Towhid,17101135,0,4th,male,0.0
Student2,Hasib,17101118,1,4th,male,0.0
Student3,Tamjid,171011152,2,4th,male,1.0
Student4,Sadi,17101150,3,4th,male,1.0
Student5,Nahian,17101114,4,4th,male,0.0
Student6,Sabuj,17101139,5,4th,male,0.0


### del col

In [40]:
del df['Greater']
df.head(1)

Unnamed: 0,Name,ID,Roll,Year,Sex
Student1,Towhid,17101135,0,4th,male


### Dataframe of nested dic

**If we create nested dic as a data of dataframe. pandas will interpret the outer dict keys
as the columns and the inner keys as the row indices**

In [41]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [42]:
popdf = pd.DataFrame(pop)
popdf

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [43]:
popdf.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


In [44]:
pdata = {'Ohio': popdf['Ohio'][:-1],'Nevada': popdf['Nevada'][:2]}
df = pd.DataFrame(pdata)
df

Unnamed: 0,Ohio,Nevada
2000,1.5,
2001,1.7,2.4


Ohio and Nevada are some states name and 2000 , 2001 are years. so, here we just mentioned it.

In [45]:
df.index.name = 'year';df.columns.name = 'States'
df

States,Ohio,Nevada
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,1.5,
2001,1.7,2.4


In [46]:
df.values

array([[1.5, nan],
       [1.7, 2.4]])

In [47]:
df.index,df.columns

(Int64Index([2000, 2001], dtype='int64', name='year'),
 Index(['Ohio', 'Nevada'], dtype='object', name='States'))

## Index Objects

In [48]:
obj = pd.Series(range(3),index=['a','b','c'])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [49]:
index[:2]

Index(['a', 'b'], dtype='object')

In [50]:
#index[0] = 'd' # immutable

In [51]:
obj = pd.Series([1,2,3])
index = pd.Index(np.arange(3))
obj.index = index
obj.index is index

True

In [52]:
'Ohio' in df.columns

True

# Essential Functionality

## Reindexing

### For Series

In [53]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [54]:
obj2 = obj.reindex(['a', 'b', 'c', 'd','e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [55]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])

In [56]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

### For DataFrame

In [57]:
frame = pd.DataFrame(np.arange(9).reshape((3,3)),
                    index=['a', 'c', 'd'],columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [58]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [59]:
obj = frame.reindex(['a', 'c', 'd','e'])
obj

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
c,3.0,4.0,5.0
d,6.0,7.0,8.0
e,,,


In [60]:
obj = obj.reindex(index=['a', 'c', 'd','f'],columns=['Texas', 'Brazil', 'California'])
obj

Unnamed: 0,Texas,Brazil,California
a,1.0,,2.0
c,4.0,,5.0
d,7.0,,8.0
f,,,


In [61]:
#reindexing by loc 
obj = frame.loc[ ['a', 'c', 'd','f','g'],['Texas', 'Brazil', 'California','Pexas']  ]

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self._getitem_tuple(key)


In [62]:
obj

Unnamed: 0,Texas,Brazil,California,Pexas
a,1.0,,2.0,
c,4.0,,5.0,
d,7.0,,8.0,
f,,,,
g,,,,


## Dropping Entries from an Axis

### For Seires

**Reurns object**

drop method will return a new object with the indicated value or values deleted from. Main object will remain same.
an axis

In [63]:
obj = pd.Series(np.arange(5),index=['a', 'b', 'c', 'd', 'e'])
obj

a    0
b    1
c    2
d    3
e    4
dtype: int32

In [64]:
#main obj will remain same
new_obj = obj.drop('b')
new_obj,obj

(a    0
 c    2
 d    3
 e    4
 dtype: int32, a    0
 b    1
 c    2
 d    3
 e    4
 dtype: int32)

In [65]:
new_obj = obj.drop(['b','c'])
new_obj

a    0
d    3
e    4
dtype: int32

**Does not returns object**

In [66]:
obj.drop('b',inplace=True)
obj

a    0
c    2
d    3
e    4
dtype: int32

### For DataFrame

With DataFrame, index values can be deleted from either axis.

In [67]:
data = pd.DataFrame(np.arange(16).reshape((4,4)),index=['Ohio', 'Colorado', 'Utah', 'New York']
                    ,columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


#### Drop rows

**Returns object**

In [68]:
#this will return an object but the main object will be unchanged
new_data = data.drop(['Colorado','Utah'])
new_data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
New York,12,13,14,15


**Does not return object**

In [69]:
#this will drop from the main object and does not return an object
data.drop('Utah',inplace=True)
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,12,13,14,15


#### Drop columns

You can drop values from the columns by passing **axis=1 or axis='columns'.**

In [70]:
new_data = data.drop('one',axis='columns')
new_data
#similar code
new_data = data.drop(['one','two'],axis=1)
new_data

Unnamed: 0,three,four
Ohio,2,3
Colorado,6,7
New York,14,15


In [71]:
df = pd.DataFrame(np.arange(1,5).reshape((2,2)))
df

Unnamed: 0,0,1
0,1,2
1,3,4


## Indexing, Selection, and Filtering

### indexing

We can access all indexes  by our indexing either we can access by 0,1,2... indexing.

#### Series

In [72]:
obj = pd.Series(np.arange(4),index=['a', 'b', 'c', 'd'])
obj

a    0
b    1
c    2
d    3
dtype: int32

In [73]:
obj['a'],obj[0]

(0, 0)

In [74]:
obj[[1,2]]

b    1
c    2
dtype: int32

In [75]:
obj[ obj<2 ]

a    0
b    1
dtype: int32

In [76]:
#end poin is inclusive, that means it will access values including c index.
obj['a':'c']

a    0
b    1
c    2
dtype: int32

#### DataFrame

In [77]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
 index=['Ohio', 'Colorado', 'Utah', 'New York'],
 columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [78]:
#row wise slicing
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [79]:
data[ ['three', 'one'] ]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [80]:
data[ data['three']>5 ]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [81]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


### Selection with loc and iloc

#### loc

In [82]:
data.loc['Ohio',['one','two']]

one    0
two    0
Name: Ohio, dtype: int32

In [83]:
data.loc[['Ohio','Colorado'],['one','two']]

Unnamed: 0,one,two
Ohio,0,0
Colorado,0,5


In [84]:
data.loc['Ohio':'Utah',['one','two'] ]

Unnamed: 0,one,two
Ohio,0,0
Colorado,0,5
Utah,8,9


In [85]:
data.loc['Ohio':'Utah','one':'three' ]

Unnamed: 0,one,two,three
Ohio,0,0,0
Colorado,0,5,6
Utah,8,9,10


In [86]:
data.loc[['Ohio','Utah'],['one','three' ]]

Unnamed: 0,one,three
Ohio,0,0
Utah,8,10


#### iloc

In [87]:
#insted of having manual index we can access it by default indexing
data.iloc[:2,[0,1]]

Unnamed: 0,one,two
Ohio,0,0
Colorado,0,5


In [88]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

In [89]:
data.iloc[[1, 2], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


In [90]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


### Integer Indexes

In [91]:
ser = pd.Series(np.arange(3))

In [92]:
ser.iloc[-1],ser.loc[1] #ser[-1] will raise error.

(2, 1)

In [93]:
ser2 = pd.Series(np.arange(3),index=['a','b','c'])

In [94]:
ser2[-1]

2

### Arithmetic and Data Alignment

In [95]:
ser3 = pd.Series(np.arange(4,7),index=['a','b','d'])

In [96]:
ser2+ser3

a    4.0
b    6.0
c    NaN
d    NaN
dtype: float64

Since the 'c' and 'e' columns are not found in both DataFrame objects, they appear
as all missing in the result. The same holds for the rows whose labels are not common
to both objects.

In [97]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
index=['Ohio', 'Texas', 'Colorado'])

df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
 index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [98]:
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [99]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [100]:
df = df1+df2
df

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


### Arithmetic methods with fill values

In [101]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)),
 columns=list('abcd'))

df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)),
 columns=list('abcde'))

In [102]:
df1.add(df2,fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [103]:
1/df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [104]:
df1.rdiv(1,fill_value=0)

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [105]:
#those index which is not present on df1 will raise a value Nan,but fill_value will fill the Nan with 0.
df1.reindex(columns=df2.index,fill_value=0)

Unnamed: 0,0,1,2,3
0,0,0,0,0
1,0,0,0,0
2,0,0,0,0


### Operations between DataFrame and Series

In [106]:
arr = np.arange(9).reshape((3,3))
arr

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [107]:
arr[0]

array([0, 1, 2])

In [108]:
arr-arr[0]

array([[0, 0, 0],
       [3, 3, 3],
       [6, 6, 6]])

In [109]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)),
 columns=list('bde'), index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


By default, arithmetic between DataFrame and Series matches the index of the Series
on the DataFrame’s columns, broadcasting down the rows

In [110]:
ser = frame.iloc[0]
ser

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [111]:
frame - ser

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [112]:
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
series2

b    0
e    1
f    2
dtype: int64

In [113]:
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


If we want to instead broadcast over the columns, matching on the rows, you have to
use one of the arithmetic methods.

In [114]:
series3 = frame.iloc[:,1]
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

The axis number that we pass is the axis to match on. In this case we mean to match
on the DataFrame’s row index (axis='index' or axis=0) and broadcast across.

In [115]:
frame.sub(series3,axis=0)

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


### Function Application and Mapping

In [116]:
frame = pd.DataFrame(np.random.randn(4,3),columns=list('bde'),
                    index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,-1.088499,-0.649177,-0.794986
Ohio,-1.171087,-1.031132,0.091789
Texas,-0.210896,1.796893,0.188664
Oregon,1.247805,0.485808,1.442713


In [117]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,1.088499,0.649177,0.794986
Ohio,1.171087,1.031132,0.091789
Texas,0.210896,1.796893,0.188664
Oregon,1.247805,0.485808,1.442713


#### Applying function to each(row/column)

In [118]:
f = lambda x: x.max() - x.min()

In [119]:
frame.apply(f,axis='columns')

Utah      0.439322
Ohio      1.262876
Texas     2.007789
Oregon    0.956905
dtype: float64

In [120]:
frame.iloc[0,:].max() -frame.iloc[0,:].min()

0.43932206259324336

In [121]:
frame = pd.DataFrame(np.arange(9).reshape((3,3)))

In [122]:
sq = lambda x: x ** 2

In [123]:
frame.apply(sq,axis=0)

Unnamed: 0,0,1,2
0,0,1,4
1,9,16,25
2,36,49,64


In [124]:
def add(value):
    return value+5


In [125]:
frame.apply(add)

Unnamed: 0,0,1,2
0,5,6,7
1,8,9,10
2,11,12,13


In [126]:
add = lambda x: x+5

In [127]:
frame.apply(add)

Unnamed: 0,0,1,2
0,5,6,7
1,8,9,10
2,11,12,13


In [128]:
frame = pd.DataFrame(np.random.randn(4,3))
frame

Unnamed: 0,0,1,2
0,0.511199,-1.061617,1.318798
1,-0.222705,0.92189,-1.468974
2,0.75512,0.112218,-0.930269
3,1.47235,-0.555386,2.010496


In [129]:
form = lambda x: '%0.2f' %x

In [130]:
frame.applymap(form)

Unnamed: 0,0,1,2
0,0.51,-1.06,1.32
1,-0.22,0.92,-1.47
2,0.76,0.11,-0.93
3,1.47,-0.56,2.01


In [131]:
frame[0].map(form)

0     0.51
1    -0.22
2     0.76
3     1.47
Name: 0, dtype: object

### Sorting and Ranking

To sort
lexicographically by row or column index, use the sort_index method, which returns
a new, sorted object.

#### Index sorting

In [132]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

In [133]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [134]:
obj.sort_values()

d    0
a    1
b    2
c    3
dtype: int64

In [135]:
frame = pd.DataFrame([[4,5,10,7],[1,20,8,6]],index=['f', 'e'],
   columns=['d', 'a', 'b', 'c'])
frame

Unnamed: 0,d,a,b,c
f,4,5,10,7
e,1,20,8,6


- **Column sorting**

In [136]:
frame.sort_index(axis = 1)

Unnamed: 0,a,b,c,d
f,5,10,7,4
e,20,8,6,1


In [137]:
frame.sort_index(axis=1,ascending=False)

Unnamed: 0,d,c,b,a
f,4,7,10,5
e,1,6,8,20


- **Rows sorting**

In [138]:
frame.sort_index(axis=0)

Unnamed: 0,d,a,b,c
e,1,20,8,6
f,4,5,10,7


In [139]:
frame.sort_index(axis=0,ascending=False)

Unnamed: 0,d,a,b,c
f,4,5,10,7
e,1,20,8,6


#### Value sorting

In [140]:
obj = pd.Series([4,7,1,9,np.nan])

In [141]:
obj.sort_values() #Any missing values are sorted to the end of the Series by default

2    1.0
0    4.0
1    7.0
3    9.0
4    NaN
dtype: float64

In [142]:
frame = pd.DataFrame([  [4,5,10,-49,9],[-30,20,-19,4,2],[-29,-48,13,0,-3] ],columns=['a','d','e','b','c']  )
frame

Unnamed: 0,a,d,e,b,c
0,4,5,10,-49,9
1,-30,20,-19,4,2
2,-29,-48,13,0,-3


In [143]:
frame.sort_values(['a','b','d'])

Unnamed: 0,a,d,e,b,c
1,-30,20,-19,4,2
2,-29,-48,13,0,-3
0,4,5,10,-49,9


In [144]:
frame = frame.sort_index(axis=1)
frame

Unnamed: 0,a,b,c,d,e
0,4,-49,9,5,10
1,-30,4,2,20,-19
2,-29,0,-3,-48,13


In [145]:
frame.sort_values('b')

Unnamed: 0,a,b,c,d,e
0,4,-49,9,5,10
2,-29,0,-3,-48,13
1,-30,4,2,20,-19


In [146]:
#column sorting
lis = list('abcde')
frame[ lis[:] ] = np.sort(frame[ lis[:] ])
frame
#row sorting
lis = [0,1,2]
frame.loc[lis[:]] = np.sort(frame.loc[ lis[:] ])
frame

Unnamed: 0,a,b,c,d,e
0,-49,4,5,9,10
1,-30,-19,2,4,20
2,-48,-29,-3,0,13


In [147]:
frame.sort_values(['a','b','c'])

Unnamed: 0,a,b,c,d,e
0,-49,4,5,9,10
2,-48,-29,-3,0,13
1,-30,-19,2,4,20


#### Ranking

In [148]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])

In [149]:
np.sort(obj)

array([-5,  0,  2,  4,  4,  7,  7], dtype=int64)

- **How it works?**
rank() is by default set to ascending order. we will take 'r(x)' instead of rank(x) for explanation. This is nothing but sorting out the number in ascending/descending order. so,
- r(-5) = 1
- r(0) = 2
- r(2) = 3
- **from the double four one will be 4th and another will be 5th position so the rank will be counted as an average. so, r(4) = (4th+5th)/2 = 4.5th**
- r(4) = 4.5
- r(4) = 4.5
- r(7) = (6th+7th)/2 = 6.5
- r(7) = 6.5


- **By default method = 'average'; assign the average rank to each entry in the equal group**

In [150]:
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

Ranks can also be assigned according to the order in which they’re observed in the
data. Intead of taking average rank for the double numbers it will take the next rank number. for example: for first 7 which in the index 0 it will take 6 as a rank and for the second 7 which in the index 2, it will take 7 as a rank.
- **method='first' Assign ranks in the order the values appear in the data**

In [151]:
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [152]:
obj.rank(ascending=False,method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

- **Method='max' Use the maximum rank for the whole group.**
- maximum value is started from rank 1. for example the 7 at the level 0 is ranked as 1 and the 7 at the level 2 is marked as 2 . for double value it will take take the maximum value of them as we can se the the 7 was ranked as 2.

- **DataFrame can compute ranks over the rows or the columns:**

In [153]:
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
 'c': [-2, 5, 8, -2.5]})
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [154]:
frame.rank(axis='columns')

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


### Axis Indexes with Duplicate Labels

In [155]:
obj = pd.Series(range(5),index = ['a','a','b','b','c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [156]:
obj.index.is_unique

False

In [157]:
obj.a

a    0
a    1
dtype: int64

In [158]:
df = pd.DataFrame(np.random.randn(4,3),index = ['a','a','b','b'])
df

Unnamed: 0,0,1,2
a,-0.773422,0.238033,-1.635951
a,-0.509893,0.71167,1.212345
b,1.285534,1.679694,-0.361329
b,0.803632,1.349758,0.153192


In [159]:
df.loc['b']

Unnamed: 0,0,1,2
b,1.285534,1.679694,-0.361329
b,0.803632,1.349758,0.153192


# Summarizing and Computing Descriptive Statistics

In [160]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
 [np.nan, np.nan], [0.75, -1.3]],
 index=['a', 'b', 'c', 'd'],
 columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [161]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [162]:
df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [163]:
df.sum(axis='index')

one    9.25
two   -5.80
dtype: float64

In [164]:
df.sum(axis='columns',skipna=False)

a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

- **Returns index value of the max and min value**

In [165]:
df.idxmax(),df.idxmin()

(one    b
 two    d
 dtype: object, one    d
 two    b
 dtype: object)

In [166]:
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


- **accumulations**

In [167]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


describe is one such example, producing multiple summary statistics in one shot.

In [168]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


- **on non-numeric data, describe produces alternative summary statistics**

In [169]:
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)
obj

0     a
1     a
2     b
3     c
4     a
5     a
6     b
7     c
8     a
9     a
10    b
11    c
12    a
13    a
14    b
15    c
dtype: object

In [170]:
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

## Correlation and Covariance

In [179]:
import pandas_datareader.data as web

In [180]:
all_data = {ticker: web.get_data_yahoo(ticker)
           for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}

In [183]:
price = pd.DataFrame({ticker: data['Adj Close']
for ticker, data in all_data.items()})

volume = pd.DataFrame({ticker: data['Volume']
for ticker, data in all_data.items()})

In [185]:
price.head()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2015-06-16,117.597267,134.24054,41.477261,528.150024
2015-06-17,117.320793,134.506073,41.603962,529.26001
2015-06-18,117.855324,135.375046,42.282726,536.72998
2015-06-19,116.675667,134.361221,41.721615,536.690002
2015-06-22,117.606491,134.95665,41.839264,538.190002


In [187]:
volume.head()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2015-06-16,31494100.0,3249800.0,27070300.0,1071800
2015-06-17,32918100.0,2863000.0,28704100.0,1294200
2015-06-18,35407200.0,3330900.0,32658300.0,1833100
2015-06-19,54716900.0,7074000.0,63837000.0,1893500
2015-06-22,34039300.0,2335800.0,20318100.0,1250300


In [189]:
returns = price.pct_change()
returns.tail()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-06-08,0.005912,0.027942,0.006197,0.005715
2020-06-09,0.031578,-0.028582,0.007645,0.006602
2020-06-10,0.025728,-0.015166,0.037092,0.006654
2020-06-11,-0.04801,-0.091322,-0.053698,-0.042303
2020-06-12,0.008634,0.033048,0.007892,0.006653


The corr method of Series computes the correlation of the overlapping, non-NA,
aligned-by-index values in two Series. Relatedly, cov computes the covariance

In [192]:
returns['MSFT'].corr(returns['IBM'])

0.595572512371279

In [193]:
returns['MSFT'].cov(returns['IBM'])

0.00016576313443183543

In [194]:
returns.MSFT.corr(returns.IBM)

0.595572512371279

DataFrame’s corr and cov methods, on the other hand, return a full correlation or
covariance matrix as a DataFrame, respectively.

In [195]:
returns.corr()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,1.0,0.52937,0.713068,0.643896
IBM,0.52937,1.0,0.595573,0.527776
MSFT,0.713068,0.595573,1.0,0.751875
GOOG,0.643896,0.527776,0.751875,1.0


In [196]:
returns.cov()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,0.000332,0.000155,0.000225,0.000201
IBM,0.000155,0.000259,0.000166,0.000146
MSFT,0.000225,0.000166,0.000299,0.000223
GOOG,0.000201,0.000146,0.000223,0.000295


corrwith method,can compute pairwise correlations between a DataFrame’s columns or rows with another Series or DataFrame. Passing a Series returns a Series with the correlation value computed for each column.

In [197]:
returns.corrwith(returns.IBM)

AAPL    0.529370
IBM     1.000000
MSFT    0.595573
GOOG    0.527776
dtype: float64

Passing a DataFrame computes the correlations of matching column names. Here I
compute correlations of percent changes with volume.

In [198]:
returns.corrwith(volume)

AAPL   -0.140641
IBM    -0.113600
MSFT   -0.067274
GOOG   -0.039279
dtype: float64

Passing axis='columns' does things row-by-row instead

In [200]:
returns.corrwith(volume,axis='columns').head()

Date
2015-06-16         NaN
2015-06-17   -0.496502
2015-06-18   -0.055078
2015-06-19   -0.864198
2015-06-22    0.710834
dtype: float64

## Unique Values, Value Counts, and Membership

In [201]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

In [206]:
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [207]:
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

In [208]:
pd.value_counts(obj.values,sort=False)

a    3
d    1
c    3
b    2
dtype: int64

In [210]:
mask = obj.isin(['b', 'c'])
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [211]:
obj[ mask ]

0    c
5    b
6    b
7    c
8    c
dtype: object

In [218]:
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])

In [219]:
unique_vals = pd.Series(['c', 'b', 'a'])

In [221]:
pd.Index(unique_vals).get_indexer(to_match)

array([0, 2, 1, 1, 0, 2], dtype=int64)

compute a histogram on multiple related columns in
a DataFrame.

In [223]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
 'Qu2': [2, 3, 1, 2, 3],
 'Qu3': [1, 5, 2, 4, 4]})
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


Here, the row labels in the result are the distinct values occurring in all of the columns.
The values are the respective counts of these values in each column.

In [228]:
result = data.apply(pd.value_counts).fillna(0)
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0
