In [1]:
import numpy as np;
import pandas as pd;

### Note : 
Consider Series as a Java Map with values of a specific data type e.g. ```Map<String,Integer>, Map<String,String>```
and keys which are called as index in pandas.
A DataFrame is collection of Series (each series with it's specific data type for value) and each series having the same/similar index.
With Java analogy DataFrame is a group of Maps having same keys but different value data types, each column in the dataframe represents a Map (more specifically values in the Map)

__Java represenation__

```java
Map<String,Float> population = new HashMap<String, Float>(){
		{
			put("Mumbai",1.84f);
			put("Delhi",1.9f);
			put("Pune",0.31f);
		}
	};
	
	Map<String,Integer> area = new HashMap<String, Integer>(){
		{
			put("Mumbai",603);
			put("Delhi",1484);
			put("Pune",331);
		}
	};
	
	Map<String,String> state = new HashMap<String, String>(){
		{
			put("Mumbai","Maharashtra");
			put("Delhi","Central");
			put("Pune","Maharashtra");
		}
	};
```

__Python represenation:__

```python
cities = {
    'population':{'Delhi':1.9,'Pune':0.31,'Mumbai':1.84},
    'area':{'Delhi':1484,'Pune':331,'Mumbai':603},
    'state':{'Delhi':'Central','Pune':'Maharashtra','Mumbai':'Maharashtra'}
}
```

With MS Excel analogy, Dataframe is a worksheet, index is an index column(left most bold column ;)) and other columns as excel columns having same data-type values.


__Excel representation:__



city|area|	population	|state
-|-|-|-
Delhi|1484|1.90|Central
Mumbai|603|1.84|Maharashtra
Pune|331|0.31|Maharashtra


__Database representation:__

<pre>
CREATE TABLE CITIES(
city VARCHAR,
area NUMBER,
population NUMBER,
state VARCHAR
);
</pre>

## DDL

### DataFrame creation

In [2]:
# Naming a series
pune = pd.Series(dict(apple=30,mango=45,banana=67))
pune.name = "Pune"
print(pune)
print(pune.name)

apple     30
banana    67
mango     45
Name: Pune, dtype: int64
Pune


####  concatenating multiple series

In [3]:

mumbai = pd.Series(dict(apple=30,mango=45,banana=67))
kolkata = pd.Series(dict(apple=32,mango=90,banana=34))
delhi = pd.Series(dict(apple=20,mango=94,banana=45))

# axis should be 1, else the contents will be merged into a single series
fruits_df = pd.concat([mumbai,kolkata,delhi],axis=1)

print(type(fruits_df))

fruits_df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,0,1,2
apple,30,32,20
banana,67,34,45
mango,45,90,94


In [4]:
# assign column names
fruits_df.columns = ['Mumbai','Kolkata','Delhi']

mumbai.name = "MUMBAI" # This won't take effect because dataFrame has already been created

fruits_df

Unnamed: 0,Mumbai,Kolkata,Delhi
apple,30,32,20
banana,67,34,45
mango,45,90,94


In [5]:
# Another way to assign column names : assign names to series BEFORE concatenation


mumbai.name = "Mumbai"
kolkata.name = "Kolkata"
delhi.name = "Delhi"

pd.concat([mumbai,delhi,kolkata],axis=1)

Unnamed: 0,Mumbai,Delhi,Kolkata
apple,30,20,32
banana,67,45,34
mango,45,94,90


#### From list of series : when it (outer) is a list, every element in the list is a row

In [6]:
pd.DataFrame([mumbai,kolkata]) # Each series becomes a row

Unnamed: 0,apple,banana,mango
Mumbai,30,67,45
Kolkata,32,34,90


#### From list of lists : when it (outer) is a list, every element in the list is a row

In [7]:
# The 2d array would look like the DataFrame when punctuations(,[,]) are removed

temperatures = [
    [20,32,32],# Row1
    [35,40,40],# Row2
    [10,10,30],# Row3
]

pd.DataFrame(temperatures)

Unnamed: 0,0,1,2
0,20,32,32
1,35,40,40
2,10,10,30


In [8]:
# From random array

pd.DataFrame(
    np.random.choice(range(1,32),size=(7,4),replace=False),
    index=['S','M','T','W','T','F','S'],
    columns=('week '+str(i) for i in range(4))
)

Unnamed: 0,week 0,week 1,week 2,week 3
S,26,14,2,1
M,20,24,28,9
T,11,30,29,19
W,6,17,23,25
T,16,5,8,31
F,13,22,12,27
S,3,7,10,18


#### From list of dictionaries : when it (outer) is a list, every element in the list is a row

In [9]:
# This is a little strange
# The array would look like the DataFrame when punctuations(,[,]) are removed
# Column names derived from keys in the dictionaries

temperatures = [# Columns are named
    {'Bengaluru':20,'Delhi':32,'Mumbai':32}, # Row1
    {'Bengaluru':35,'Delhi':40,'Mumbai':40}, # Row2
    {'Bengaluru':10,'Delhi':10,'Mumbai':30}, # Row3
]

pd.DataFrame(temperatures)

Unnamed: 0,Bengaluru,Delhi,Mumbai
0,20,32,32
1,35,40,40
2,10,10,30


#### From dictionary of lists : when it (outer) is a dictionary, every key is a column

In [10]:
# Column names from the keys in the dictionaries

temperatures = {
    "Bengaluru":[20,35,10],# Column1
    "Delhi":[32,40,10], # Column2
    "Mumbai":[32,40,30], # Column3
}

pd.DataFrame(temperatures)

Unnamed: 0,Bengaluru,Delhi,Mumbai
0,20,32,32
1,35,40,40
2,10,10,30


#### From dictionary of dictionaries : when it (outer) is a dictionary, every key is a column

In [11]:
# Column names from the keys in outer dictioanry and index from the inner dictionary
# Logically most dictionaries have same data-type elements, so analogous to an excel/db column

temperatures = {
    "Bengaluru" : {'rain':20,'summer':35,'winter':10}, # Column1
    "Delhi" : {'rain':32,'summer':40,'winter':10}, # Column2
    "Mumbai" : {'rain':32,'summer':40,'winter':30}, # Column3
}

pd.DataFrame(temperatures)

Unnamed: 0,Bengaluru,Delhi,Mumbai
rain,20,32,32
summer,35,40,40
winter,10,10,30


### Index and Column arrangement

#### Reorder the columns/index while frame creation

In [12]:
temperatures = {
    "Bengaluru" : {'rain':20,'summer':35,'winter':10}, # Column1
    "Delhi" : {'rain':32,'summer':40,'winter':10}, # Column2
    "Mumbai" : {'rain':32,'summer':40,'winter':30}, # Column3
}

# Columns are already named in the data, we just re-order them, if given differnt name it becomes a NaN column
temperatures_frame = pd.DataFrame(temperatures
             ,index=['winter','rain','summer']
             ,columns=['Mumbai','Delhi','Bengaluru']
)

temperatures_frame

Unnamed: 0,Mumbai,Delhi,Bengaluru
winter,30,10,10
rain,32,32,20
summer,40,40,35


 #### Name the columns/index while frame creation

In [13]:
temperatures = [
    [27,33,35,42,41,43,34,35,37,41,26,23],
    [9,10,35,42,41,43,34,35,37,38,12,8],
    [9,10,30,34,36,35,30,28,22,27,12,8]
]


# Note : While creating this data-frame we are just naming the index and columns (in case of list-of-dict,
# dict-of-list,dict-of-dict 
# columns/index could be already named, in those cases we just re-order while creation)
temperatures_frame = pd.DataFrame(temperatures
             ,columns=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
             ,index=['Mumbai','Delhi','Bengaluru']
            
)

temperatures_frame




Unnamed: 0,Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec
Mumbai,27,33,35,42,41,43,34,35,37,41,26,23
Delhi,9,10,35,42,41,43,34,35,37,38,12,8
Bengaluru,9,10,30,34,36,35,30,28,22,27,12,8


#### Reindexing : only re-order the column/index

In [14]:
# Re-indexing : Change/Drop the index/columns and/or change their order

# This gives out a new dataFrame
temperatures_frame.reindex(columns=['Dec','May','Jul','Oct'],index=['Bengaluru','Mumbai'])



Unnamed: 0,Dec,May,Jul,Oct
Bengaluru,8,36,30,27
Mumbai,23,41,34,41


In [15]:
# This is same as above, as reindex anyways give out a new dataFrame

pd.DataFrame(temperatures_frame,columns=['Dec','May','Jul','Oct'],index=['Bengaluru','Mumbai'])

Unnamed: 0,Dec,May,Jul,Oct
Bengaluru,8,36,30,27
Mumbai,23,41,34,41


#### Renaming column/index : provide maps (old=>new) for renaming

In [16]:
temperatures_frame.rename(index={'Mumbai':'Mum'},columns={'Jan':'January','Oct':'October'},inplace=True)
temperatures_frame

Unnamed: 0,January,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,October,Nov,Dec
Mum,27,33,35,42,41,43,34,35,37,41,26,23
Delhi,9,10,35,42,41,43,34,35,37,38,12,8
Bengaluru,9,10,30,34,36,35,30,28,22,27,12,8


#### Existing column as index 

In [17]:

cities = {
    'name':['Delhi','Pune','Mumbai'],
    'population':[1.9,0.31,1.84],
    'area':[1484,331,603],
    'state':['Central','Maharashtra','Maharashtra']
}


In [18]:
pd.DataFrame(cities)

Unnamed: 0,area,name,population,state
0,1484,Delhi,1.9,Central
1,331,Pune,0.31,Maharashtra
2,603,Mumbai,1.84,Maharashtra


In [19]:
# A nice idea to use an existing column as index

pd.DataFrame(cities,
             index=cities['name'], # This gives all the city names
             columns=cities.keys()-['name'] # Now exclude the 'name' column as it is already the index column
            )

Unnamed: 0,state,population,area
Delhi,Central,1.9,1484
Pune,Maharashtra,0.31,331
Mumbai,Maharashtra,1.84,603


In [20]:
# Another inplace and simpler way to set existing column as index

cities_frame = pd.DataFrame(cities)
cities_frame.set_index('name',inplace=True)
cities_frame

Unnamed: 0_level_0,area,population,state
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Delhi,1484,1.9,Central
Pune,331,0.31,Maharashtra
Mumbai,603,1.84,Maharashtra


#### Insert new columns 

In [21]:
# alter table add column region 
cities_with_regions = cities_frame.copy(deep=True)
cities_with_regions.insert(1,'region',None)
cities_with_regions

Unnamed: 0_level_0,area,region,population,state
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Delhi,1484,,1.9,Central
Pune,331,,0.31,Maharashtra
Mumbai,603,,1.84,Maharashtra


In [22]:
# alter table add column region default True
cities_with_it = cities_frame.copy(deep=True)
cities_with_it.insert(2,'it',True)
cities_with_it

Unnamed: 0_level_0,area,population,it,state
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Delhi,1484,1.9,True,Central
Pune,331,0.31,True,Maharashtra
Mumbai,603,1.84,True,Maharashtra


In [23]:
# new column value derived from an existing column
# alter table add column region default ....
cities_with_readable_population = cities_frame.copy(deep=True)
cities_with_readable_population.insert(
    loc=3,
    column='readable population',
    value = cities_with_readable_population.population.apply(lambda p : str(p)+' lacs')
)
cities_with_readable_population

Unnamed: 0_level_0,area,population,state,readable population
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Delhi,1484,1.9,Central,1.9 lacs
Pune,331,0.31,Maharashtra,0.31 lacs
Mumbai,603,1.84,Maharashtra,1.84 lacs


## DQL

In [24]:
cities = {
    'population':[1.9,0.31,1.84],
    'area':[1484,331,603],
    'state':['Delhi','Maharashtra','Maharashtra']
}

pd.DataFrame(cities,index=['Delhi','Pune','Mumbai'])



Unnamed: 0,area,population,state
Delhi,1484,1.9,Delhi
Pune,331,0.31,Maharashtra
Mumbai,603,1.84,Maharashtra


In [25]:
# Each column is a series

print(type(cities_frame.area))
cities_frame.area


<class 'pandas.core.series.Series'>


name
Delhi     1484
Pune       331
Mumbai     603
Name: area, dtype: int64

In [26]:
# Each index is also a series (with columns as index)

print(type(cities_frame.loc['Delhi']))
print(cities_frame.loc['Delhi'].index)
cities_frame.loc['Delhi']

<class 'pandas.core.series.Series'>
Index(['area', 'population', 'state'], dtype='object')


area             1484
population        1.9
state         Central
Name: Delhi, dtype: object

### Note :  
Important to understand that the operations on dataFrames give back another dataFrames/Series. And further operations can be applied on the results (chaining)

#### Projection

In [27]:
cities_frame['area']

name
Delhi     1484
Pune       331
Mumbai     603
Name: area, dtype: int64

In [28]:
cities_frame.area # same as above

name
Delhi     1484
Pune       331
Mumbai     603
Name: area, dtype: int64

In [29]:
cities_frame[['area','population']] # select area,population from cities

Unnamed: 0_level_0,area,population
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Delhi,1484,1.9
Pune,331,0.31
Mumbai,603,1.84


In [30]:
# Apply function on each row : apply is done on the series
cities_frame.population.apply(np.log10) # select log10(population) from cities

name
Delhi     0.278754
Pune     -0.508638
Mumbai    0.264818
Name: population, dtype: float64

In [31]:
# population density
# select population*100000/area from cities where index in ('Delhi','Pune')
(cities_frame.population*100000/cities_frame.area)[['Delhi','Pune']]

name
Delhi    128.032345
Pune      93.655589
dtype: float64

### Note : 
<ul>apply : the provided function works on each index(row) of the dataFrame
<li>if the result of the operation on each index is a scalar, 'apply' outputs a series
<li>if the result of the operation on each index is a series, 'apply' outputs a dataFrame with each vector result represented as series as index.
<li>if the result of the operation on each index is a list (of same cardinality as each index in the dataFrame), 'apply' outputs a dataFrame with each vector result represented as series as index.
</ul>

In [32]:
# The output of lambda is the modified series, so final output of apply is a dataFrame
cities_frame.apply(lambda series : series.apply(lambda value : value.upper() if type(value)==str else value),axis=1)

Unnamed: 0_level_0,area,population,state
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Delhi,1484,1.9,CENTRAL
Pune,331,0.31,MAHARASHTRA
Mumbai,603,1.84,MAHARASHTRA


In [33]:
# [0,0,None] has same cardinality(3) as Delhi,Pune and Mumbai
cities_frame.apply(lambda series : [0,0,None],axis=1)

Unnamed: 0_level_0,area,population,state
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Delhi,0.0,0.0,
Pune,0.0,0.0,
Mumbai,0.0,0.0,


In [34]:
# [0,0,0,0] has different cardinality(4) than Delhi,Pune and Mumbai
cities_frame.apply(lambda series : [0,0,0,0],axis=1)

name
Delhi     [0, 0, 0, 0]
Pune      [0, 0, 0, 0]
Mumbai    [0, 0, 0, 0]
dtype: object

In [35]:
# This is same as above example of population density
# select population*100000/area from cities where index in ('Delhi','Pune')
cities_frame.apply(lambda series : series.population*100000/series.area,axis=1)[['Delhi','Pune']]

name
Delhi    128.032345
Pune      93.655589
dtype: float64

In [36]:
# select area||' sq.kms',population||' lakhs' from cities

cities_frame[['area','population']] \
.apply(lambda series : [str(series.area)+' sq.kms',str(series.population)+' lakhs'],axis=1)

Unnamed: 0_level_0,area,population
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Delhi,1484.0 sq.kms,1.9 lakhs
Pune,331.0 sq.kms,0.31 lakhs
Mumbai,603.0 sq.kms,1.84 lakhs


In [37]:
cities_frame[['area','population']]  \
.apply(lambda series : series.area,axis=1)

name
Delhi     1484.0
Pune       331.0
Mumbai     603.0
dtype: float64

#### Selection (where clause) 

In [38]:
cities_frame.loc['Delhi']  # Where index='Delhi'

area             1484
population        1.9
state         Central
Name: Delhi, dtype: object

In [39]:
cities_frame.loc[['Delhi','Pune']] # Where index in ('Delhi','Pune')

Unnamed: 0_level_0,area,population,state
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Delhi,1484,1.9,Central
Pune,331,0.31,Maharashtra


In [40]:
cities_frame.loc[cities_frame.area>600] # Note : cities_frame.area>600 is a bool series

Unnamed: 0_level_0,area,population,state
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Delhi,1484,1.9,Central
Mumbai,603,1.84,Maharashtra


#### Projection + Selection

In [41]:
cities_frame[['state','population']].loc[cities_frame.area>600] # select state,population from cities where area>600

Unnamed: 0_level_0,state,population
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Delhi,Central,1.9
Mumbai,Maharashtra,1.84


In [42]:
cities_frame.loc[cities_frame.area>600][['state','population']] # select state,population from cities where area>600

Unnamed: 0_level_0,state,population
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Delhi,Central,1.9
Mumbai,Maharashtra,1.84


In [43]:
# select area from cities where name = 'Delhi'
cities_frame.loc['Delhi','area'] # Note first param is index and second param is column

1484

In [44]:
# select area from cities where name = 'Delhi'
cities_frame.at['Delhi','area']

1484

#### Note : Use 'at' if you only need to get or set a single value in a DataFrame or Series." 
#### loc on the other hand can be used to access a single value but also to access a group 
#### of rows and columns by a label or labels.
#### When it comes to speed the answer is clear: we should definitely use at.

In [45]:
# select area where name in ('Delhi','Mumbai')
cities_frame.loc[['Delhi','Mumbai'],'area']

name
Delhi     1484
Mumbai     603
Name: area, dtype: int64

In [46]:
# select state, population from cities where area > 600
cities_frame.loc[cities_frame.area>600,['state','population']]

Unnamed: 0_level_0,state,population
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Delhi,Central,1.9
Mumbai,Maharashtra,1.84


In [47]:
# Nested query
# select * from cities where area>600 (select * from cities where state='Maharashtra')
cities_frame.loc[cities_frame.area>600].loc[cities_frame.state=='Maharashtra'] 

Unnamed: 0_level_0,area,population,state
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mumbai,603,1.84,Maharashtra


#### Order by

In [48]:
# select state,population from cities order by state desc, population asc

cities_frame.sort_values(by=['state','population'],ascending=[False,True])[['state','population']]

Unnamed: 0_level_0,state,population
name,Unnamed: 1_level_1,Unnamed: 2_level_1
Pune,Maharashtra,0.31
Mumbai,Maharashtra,1.84
Delhi,Central,1.9


#### Aggregation

In [49]:
cities_frame.area.sum() # select sum(area) from cities

2418

In [50]:
np.mean(cities_frame.population) # select avg(area) from cities

1.3499999999999999

In [51]:
cities_frame.area.max(),cities_frame.population.min() # select max(area),min(population) from cities

(1484, 0.31)

### DML

#### Updates

In [52]:
cities_frame = pd.DataFrame(
    cities,
    index=['Delhi','Pune','Mumbai'],
    columns=['area','population','state','type','tier','old_name']
)
cities_frame

Unnamed: 0,area,population,state,type,tier,old_name
Delhi,1484,1.9,Delhi,,,
Pune,331,0.31,Maharashtra,,,
Mumbai,603,1.84,Maharashtra,,,


In [53]:
# update cities set type = 'Unassigned'
# Update all values in column to a single value
cities_frame.type = 'Unassigned'
cities_frame

Unnamed: 0,area,population,state,type,tier,old_name
Delhi,1484,1.9,Delhi,Unassigned,,
Pune,331,0.31,Maharashtra,Unassigned,,
Mumbai,603,1.84,Maharashtra,Unassigned,,


In [54]:
# Update values with a list
cities_frame.type = ['UT','Normal','State Capital']
cities_frame

Unnamed: 0,area,population,state,type,tier,old_name
Delhi,1484,1.9,Delhi,UT,,
Pune,331,0.31,Maharashtra,Normal,,
Mumbai,603,1.84,Maharashtra,State Capital,,


In [55]:
# Update values with a series
cities_frame.old_name = cities_frame.index
cities_frame.type = cities_frame.area > 500
cities_frame

Unnamed: 0,area,population,state,type,tier,old_name
Delhi,1484,1.9,Delhi,True,,Delhi
Pune,331,0.31,Maharashtra,False,,Pune
Mumbai,603,1.84,Maharashtra,True,,Mumbai


In [56]:
# Update values with a series
cities_frame.type = cities_frame.area.apply(lambda v : 'Big' if v > 500 else 'Small')
cities_frame

Unnamed: 0,area,population,state,type,tier,old_name
Delhi,1484,1.9,Delhi,Big,,Delhi
Pune,331,0.31,Maharashtra,Small,,Pune
Mumbai,603,1.84,Maharashtra,Big,,Mumbai


In [57]:
# Update values with a series (which has subset of keys)
cities_frame.type = pd.Series(['UT','Capital'],index=['Delhi','Mumbai'])
cities_frame

Unnamed: 0,area,population,state,type,tier,old_name
Delhi,1484,1.9,Delhi,UT,,Delhi
Pune,331,0.31,Maharashtra,,,Pune
Mumbai,603,1.84,Maharashtra,Capital,,Mumbai


In [58]:
# update cities set type = 'Big' where area > 600
cities_frame.loc[cities_frame.area>600,'type']='Big'
cities_frame

Unnamed: 0,area,population,state,type,tier,old_name
Delhi,1484,1.9,Delhi,Big,,Delhi
Pune,331,0.31,Maharashtra,,,Pune
Mumbai,603,1.84,Maharashtra,Big,,Mumbai


In [59]:
# update cities set type = 'Big',tier = 1 where area > 600
cities_frame.loc[cities_frame.area>600,['type','tier']]=('Big',1)
cities_frame

Unnamed: 0,area,population,state,type,tier,old_name
Delhi,1484,1.9,Delhi,Big,1.0,Delhi
Pune,331,0.31,Maharashtra,,,Pune
Mumbai,603,1.84,Maharashtra,Big,1.0,Mumbai


In [60]:
# update cities set type = 'Big',tier = 1 where index in ('Delhi','Mumbai')
cities_frame.loc[['Delhi','Mumbai'],['type','tier']]=('Big',1)
cities_frame

Unnamed: 0,area,population,state,type,tier,old_name
Delhi,1484,1.9,Delhi,Big,1.0,Delhi
Pune,331,0.31,Maharashtra,,,Pune
Mumbai,603,1.84,Maharashtra,Big,1.0,Mumbai


In [61]:
# Set all value in All rows having area>1000 to None

cities_frame.loc[cities_frame.area>1000] = None
cities_frame


Unnamed: 0,area,population,state,type,tier,old_name
Delhi,,,,,,
Pune,331.0,0.31,Maharashtra,,,Pune
Mumbai,603.0,1.84,Maharashtra,Big,1.0,Mumbai


In [62]:
cities_frame.loc['Delhi'] = (1484,1.90,'Delhi','Big',1,'Delhi')
cities_frame

Unnamed: 0,area,population,state,type,tier,old_name
Delhi,1484.0,1.9,Delhi,Big,1.0,Delhi
Pune,331.0,0.31,Maharashtra,,,Pune
Mumbai,603.0,1.84,Maharashtra,Big,1.0,Mumbai


#### Update using Replace

In [63]:
series = pd.Series([2,3,5,1,4,5])

In [64]:
series.replace(5,0)

0    2
1    3
2    0
3    1
4    4
5    0
dtype: int64

In [65]:
series.replace(5,0,inplace=True)
series

0    2
1    3
2    0
3    1
4    4
5    0
dtype: int64

In [66]:
# update n where n in ()
series.replace([1,2,3],-1)

0   -1
1   -1
2    0
3   -1
4    4
5    0
dtype: int64

In [67]:
series.replace([1,2,3],[-1,-2,-3])

0   -2
1   -3
2    0
3   -1
4    4
5    0
dtype: int64

In [68]:
# replace all values in all columns and all indices
cities_frame.replace('Delhi','Dilli')

Unnamed: 0,area,population,state,type,tier,old_name
Delhi,1484.0,1.9,Dilli,Big,1.0,Dilli
Pune,331.0,0.31,Maharashtra,,,Pune
Mumbai,603.0,1.84,Maharashtra,Big,1.0,Mumbai


In [69]:
# replace multiple values with multiple values
cities_frame.replace(['Delhi','Mumbai'],['Dilli','Bombay'])

Unnamed: 0,area,population,state,type,tier,old_name
Delhi,1484.0,1.9,Dilli,Big,1.0,Dilli
Pune,331.0,0.31,Maharashtra,,,Pune
Mumbai,603.0,1.84,Maharashtra,Big,1.0,Bombay


In [70]:
# replace with regex
cities_frame.replace(r'(.+)i$',r'\1y',regex=True)

Unnamed: 0,area,population,state,type,tier,old_name
Delhi,1484.0,1.9,Delhy,Big,1.0,Delhy
Pune,331.0,0.31,Maharashtra,,,Pune
Mumbai,603.0,1.84,Maharashtra,Big,1.0,Mumbay


In [71]:
# replace with dictionary
cities_frame.replace(to_replace={'Delhi':'Dilli','Mumbai':'Bombay'})

Unnamed: 0,area,population,state,type,tier,old_name
Delhi,1484.0,1.9,Dilli,Big,1.0,Dilli
Pune,331.0,0.31,Maharashtra,,,Pune
Mumbai,603.0,1.84,Maharashtra,Big,1.0,Bombay


In [72]:
# replace with dictionary, mentioning the specific columns
cities_frame.replace(to_replace={'state':{'Delhi':'Dilli'},'old_name':{'Mumbai':'Bombay'}})

Unnamed: 0,area,population,state,type,tier,old_name
Delhi,1484.0,1.9,Dilli,Big,1.0,Delhi
Pune,331.0,0.31,Maharashtra,,,Pune
Mumbai,603.0,1.84,Maharashtra,Big,1.0,Bombay


In [73]:
# replace inplace : mutate original values
mutable_cities_frame = cities_frame.copy(deep=True)
mutable_cities_frame.replace(to_replace={'state':{'Delhi':'Dilli'},'old_name':{'Mumbai':'Bombay'}},inplace=True)
mutable_cities_frame

Unnamed: 0,area,population,state,type,tier,old_name
Delhi,1484.0,1.9,Dilli,Big,1.0,Delhi
Pune,331.0,0.31,Maharashtra,,,Pune
Mumbai,603.0,1.84,Maharashtra,Big,1.0,Bombay


In [74]:
cities_frame.replace(to_replace='Maharashtra', 
           value=None, 
           method='ffill')

TypeError: cannot replace ['Maharashtra'] with method ffill on a DataFrame