# Python Pandas - Reindexing

Reindexing changes the row labels and column labels of a DataFrame. To reindex means to conform the data to match a given set of labels along a particular axis.

Multiple operations can be accomplished through indexing like −
<ul>
<li>Reorder the existing data to match a new set of labels.</li>
<li>Insert missing value (NA) markers in label locations where no data for the label existed.</li>
</ul>

In [2]:
import pandas as pd 
import numpy as np

In [3]:
N=20

df = pd.DataFrame({
   'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
   'x': np.linspace(0,stop=N-1,num=N),
   'y': np.random.rand(N),
   'C': np.random.choice(['Low','Medium','High'],N).tolist(),
   'D': np.random.normal(100, 10, size=(N)).tolist()
})
df

Unnamed: 0,A,x,y,C,D
0,2016-01-01,0.0,0.87898,Low,93.60384
1,2016-01-02,1.0,0.256384,Medium,90.5852
2,2016-01-03,2.0,0.171137,High,102.42079
3,2016-01-04,3.0,0.69882,High,96.127178
4,2016-01-05,4.0,0.970895,Medium,113.495153
5,2016-01-06,5.0,0.103111,Medium,99.700516
6,2016-01-07,6.0,0.343106,Low,94.709097
7,2016-01-08,7.0,0.62947,Medium,88.044143
8,2016-01-09,8.0,0.159187,High,103.089473
9,2016-01-10,9.0,0.216422,High,116.182161


In [4]:
#reindex the DataFrame
df_reindexed = df.reindex(index=[0,2,5], columns=['A', 'C', 'B'])

print (df_reindexed)

           A       C   B
0 2016-01-01     Low NaN
2 2016-01-03    High NaN
5 2016-01-06  Medium NaN


### Reindex to Align with Other Objects

You may wish to take an object and reindex its axes to be labeled the same as another object. Consider the following example to understand the same.

In [5]:
df1 = pd.DataFrame(np.random.randn(10,3),columns=['col1','col2','col3'])
df1

Unnamed: 0,col1,col2,col3
0,-0.224212,0.112174,-1.197711
1,1.554986,-1.215673,-1.088758
2,-0.326401,1.21211,0.729213
3,-0.094391,-2.604147,1.505796
4,0.04717,-0.205964,1.214196
5,0.078137,0.643116,-1.293871
6,-0.183651,1.88709,0.326264
7,-0.752645,-0.820832,0.368542
8,1.39745,-2.1273,-0.047126
9,-0.042023,0.305663,0.411343


In [6]:
df2 = pd.DataFrame(np.random.randn(7,3),columns=['col1','col2','col3'])
df2

Unnamed: 0,col1,col2,col3
0,-0.178566,0.582605,0.511536
1,-0.250896,-0.33249,-1.199264
2,0.123365,0.780566,0.828859
3,0.551816,0.054025,0.408862
4,-0.815953,-1.875632,-1.288187
5,1.377147,0.933458,0.708744
6,0.028928,0.227809,-1.472817


In [7]:
df1 = df1.reindex_like(df2)
df1

Unnamed: 0,col1,col2,col3
0,-0.224212,0.112174,-1.197711
1,1.554986,-1.215673,-1.088758
2,-0.326401,1.21211,0.729213
3,-0.094391,-2.604147,1.505796
4,0.04717,-0.205964,1.214196
5,0.078137,0.643116,-1.293871
6,-0.183651,1.88709,0.326264


#### Note − 
Here, the df1 DataFrame is altered and reindexed like df2. The column names should be matched or else NAN will be added for the entire column label.

### Filling while ReIndexing
reindex() takes an optional parameter method which is a filling method with values as follows −
<ul>
    <li>pad/ffill − Fill values forward</li>
    <li>bfill/backfill − Fill values backward</li>
    <li>nearest − Fill from the nearest index values</li>
</ul> 

In [11]:
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])

# Padding NAN's
print (df2.reindex_like(df1))

       col1      col2      col3
0  0.697793  0.298436  0.646197
1  0.150540  1.345326  0.190189
2       NaN       NaN       NaN
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN


In [9]:
# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill:")
print (df2.reindex_like(df1,method='ffill'))

Data Frame with Forward Fill:
       col1      col2      col3
0 -0.900908  0.539499  0.050161
1  0.330310  0.327496  0.806398
2  0.330310  0.327496  0.806398
3  0.330310  0.327496  0.806398
4  0.330310  0.327496  0.806398
5  0.330310  0.327496  0.806398


### Limits on Filling while Reindexing
The limit argument provides additional control over filling while reindexing. Limit specifies the maximum count of consecutive matches. Let us consider the following example to understand the same −

In [13]:
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
df2 = pd.DataFrame(np.random.randn(2,3),columns=['col1','col2','col3'])

# Padding NAN's
print (df2.reindex_like(df1))


       col1      col2      col3
0 -1.405854  0.537727 -0.321907
1 -0.100205  0.226476 -0.731198
2       NaN       NaN       NaN
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN


In [14]:
# Now Fill the NAN's with preceding Values
print ("Data Frame with Forward Fill limiting to 1:")
print (df2.reindex_like(df1,method='ffill',limit=1))

Data Frame with Forward Fill limiting to 1:
       col1      col2      col3
0 -1.405854  0.537727 -0.321907
1 -0.100205  0.226476 -0.731198
2 -0.100205  0.226476 -0.731198
3       NaN       NaN       NaN
4       NaN       NaN       NaN
5       NaN       NaN       NaN


### Renaming
The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.

In [15]:
df1 = pd.DataFrame(np.random.randn(6,3),columns=['col1','col2','col3'])
print (df1)

       col1      col2      col3
0  1.556941  1.741839  1.552444
1 -1.238797  1.094284  0.732446
2  1.312957 -0.796135 -0.737363
3 -0.368517  0.076317  0.340407
4 -0.055006 -1.469500 -0.232826
5 -0.924197 -0.555527 -1.789623


In [16]:
print ("After renaming the rows and columns:")
print (df1.rename(columns={'col1' : 'c1', 'col2' : 'c2'},
index = {0 : 'apple', 1 : 'banana', 2 : 'durian'}))

After renaming the rows and columns:
              c1        c2      col3
apple   1.556941  1.741839  1.552444
banana -1.238797  1.094284  0.732446
durian  1.312957 -0.796135 -0.737363
3      -0.368517  0.076317  0.340407
4      -0.055006 -1.469500 -0.232826
5      -0.924197 -0.555527 -1.789623


The rename() method provides an inplace named parameter, which by default is False and copies the underlying data. Pass inplace=True to rename the data in place.

# Python Pandas - Iteration

The behavior of basic iteration over Pandas objects depends on the type. When iterating over a Series, it is regarded as array-like, and basic iteration produces the values. Other data structures, like DataFrame and Panel, follow the dict-like convention of iterating over the keys of the objects.

In short, basic iteration (for i in object) produces −
<ul>
    <li>Series − values</li>
    <li>DataFrame − column labels</li>
    <li>Panel − item labels</li>
</ul>

### Iterating a DataFrame
Iterating a DataFrame gives column names. Let us consider the following example to understand the same.

In [18]:
N=20
df = pd.DataFrame({
   'A': pd.date_range(start='2016-01-01',periods=N,freq='D'),
   'x': np.linspace(0,stop=N-1,num=N),
   'y': np.random.rand(N),
   'C': np.random.choice(['Low','Medium','High'],N).tolist(),
   'D': np.random.normal(100, 10, size=(N)).tolist()
   })

print(df)

            A     x         y       C           D
0  2016-01-01   0.0  0.181682    High  104.891946
1  2016-01-02   1.0  0.211987     Low   99.385930
2  2016-01-03   2.0  0.306421     Low  102.221874
3  2016-01-04   3.0  0.965610  Medium   94.797476
4  2016-01-05   4.0  0.634441     Low   83.056414
5  2016-01-06   5.0  0.606227  Medium  100.852246
6  2016-01-07   6.0  0.563795     Low   89.489259
7  2016-01-08   7.0  0.077165     Low  103.320295
8  2016-01-09   8.0  0.496946  Medium  100.284093
9  2016-01-10   9.0  0.912456    High  123.987612
10 2016-01-11  10.0  0.690592  Medium  114.588922
11 2016-01-12  11.0  0.142421    High   86.344563
12 2016-01-13  12.0  0.642377     Low  112.193540
13 2016-01-14  13.0  0.022330     Low   80.314737
14 2016-01-15  14.0  0.027037     Low  109.689496
15 2016-01-16  15.0  0.280767     Low  115.179712
16 2016-01-17  16.0  0.946752     Low  101.149155
17 2016-01-18  17.0  0.808514     Low   96.656633
18 2016-01-19  18.0  0.141811  Medium  109.242297


In [21]:
# Iterating a DataFrame gives column names
for col in df:
    print (col)

A
x
y
C
D


#### To iterate over the rows of the DataFrame, we can use the following functions −
<ul>
    <li>iteritems() − to iterate over the (key,value) pairs</li>
    <li>iterrows() − iterate over the rows as (index,series) pairs</li>
    <li>itertuples() − iterate over the rows as namedtuples</li>
</ul>    

### iteritems()
Iterates over each column as key, value pair with label as key and column value as a Series object

In [24]:
df = pd.DataFrame(np.random.randn(4,3),columns=['col1','col2','col3'])
print(df)
for key,value in df.iteritems():
    print("--------------------------")
    print (key,value)

       col1      col2      col3
0  0.056045 -0.185638 -0.529811
1  1.166736 -0.458006  0.249437
2 -0.469067  0.151883  1.405179
3 -0.210363 -0.081868  0.684018
--------------------------
col1 0    0.056045
1    1.166736
2   -0.469067
3   -0.210363
Name: col1, dtype: float64
--------------------------
col2 0   -0.185638
1   -0.458006
2    0.151883
3   -0.081868
Name: col2, dtype: float64
--------------------------
col3 0   -0.529811
1    0.249437
2    1.405179
3    0.684018
Name: col3, dtype: float64


Observe, each column is iterated separately as a key-value pair in a Series.

### iterrows()
iterrows() returns the iterator yielding each index value along with a series containing the data in each row.

In [26]:
df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
for row_index,row in df.iterrows():
    print("------------------------")
    print(row_index,row)

------------------------
0 col1    0.378169
col2   -0.190089
col3   -1.589452
Name: 0, dtype: float64
------------------------
1 col1   -0.454305
col2   -0.790790
col3   -0.406320
Name: 1, dtype: float64
------------------------
2 col1    0.209550
col2    0.775091
col3   -1.287897
Name: 2, dtype: float64
------------------------
3 col1    0.826596
col2    1.332250
col3    1.441723
Name: 3, dtype: float64


### itertuples()
itertuples() method will return an iterator yielding a named tuple for each row in the DataFrame. The first element of the tuple will be the row’s corresponding index value, while the remaining values are the row values.

In [29]:
df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])
for row in df.itertuples():
    print (row)

Pandas(Index=0, col1=-1.3603560444873517, col2=-0.7387422414355623, col3=-1.2032482173788601)
Pandas(Index=1, col1=-0.8261486873429128, col2=0.35548358497595006, col3=-1.6167087676458745)
Pandas(Index=2, col1=-1.1908740654913663, col2=0.5346431642843509, col3=-0.23629217802002744)
Pandas(Index=3, col1=-0.2962369966370701, col2=0.1793361202804302, col3=-1.9939515236029182)


### Note 
Do not try to modify any object while iterating. Iterating is meant for reading and the iterator returns a copy of the original object (a view), thus the changes will not reflect on the original object.

In [30]:
df = pd.DataFrame(np.random.randn(4,3),columns = ['col1','col2','col3'])

for index, row in df.iterrows():
    row['a'] = 10
print (df)

       col1      col2      col3
0  0.031167  1.214096  0.163644
1 -0.052015  0.968294  0.384988
2 -0.177066 -0.432456 -0.683135
3  0.485800 -0.108991  1.606057


# Python Pandas - Sorting

There are two kinds of sorting available in Pandas. They are
<ul>
    <li>By label</li>
    <li>By Actual Value</li>
</ul>    

In [32]:
unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1'])
print (unsorted_df)
# In unsorted_df, the labels and the values are unsorted. Let us see how these can be sorted.

       col2      col1
1  0.953234  1.257935
4  0.221017  1.754287
6  1.672402  0.531242
2 -1.162116  1.469414
3  0.793206  0.498954
5 -0.872992  0.221028
9  1.098998 -0.164684
8  0.224294 -0.656263
0 -1.667367  0.123475
7 -0.350823  1.386921


### Order of Sorting
By passing the Boolean value to ascending parameter, the order of the sorting can be controlled. Let us consider the following example to understand the same.

In [3]:
unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])

sorted_df = unsorted_df.sort_index(ascending=False)
print (sorted_df)

       col2      col1
9  1.044330  0.648619
8  0.836975  0.938426
7  2.260624 -0.506303
6  0.672890 -1.339164
5 -1.999130 -0.048772
4  1.103808 -0.130367
3  2.027210  0.208031
2  1.180492 -0.440598
1 -0.024700  1.887766
0  0.736454  0.469602


### Sort the Columns
By passing the axis argument with a value 0 or 1, the sorting can be done on the column labels. By default, axis=0, sort by row. Let us consider the following example to understand the same.

In [4]:
unsorted_df = pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns = ['col2','col1'])
 
sorted_df=unsorted_df.sort_index(axis=1)

print (sorted_df)

       col1      col2
1 -0.383478  1.131899
4  0.630419 -0.420324
6 -0.400175 -0.136963
2 -0.266862  1.025868
3 -0.483423 -0.636009
5  0.974012  0.712953
9 -0.429489 -1.933711
8  0.111452 -1.088416
0 -0.128751 -0.869947
7 -0.268107 -0.283661


### Sorting Algorithm
sort_values() provides a provision to choose the algorithm from mergesort, heapsort and quicksort. Mergesort is the only stable algorithm.

In [5]:
unsorted_df = pd.DataFrame({'col1':[2,1,1,1],'col2':[1,3,2,4]})
sorted_df = unsorted_df.sort_values(by='col1' ,kind='mergesort')

print (sorted_df)

   col1  col2
1     1     3
2     1     2
3     1     4
0     2     1


# Python Pandas - Working with Text Data

In this , we will discuss the string operations with our basic Series/Index. In the subsequent chapters, we will learn how to apply these string functions on the DataFrame.

Pandas provides a set of string functions which make it easy to operate on string data. Most importantly, these functions ignore (or exclude) missing/NaN values.

Almost, all of these methods work with Python string functions (refer: https://docs.python.org/3/library/stdtypes.html#string-methods). So, convert the Series Object to String Object and then perform the operation.



Let us now create a Series and see how all the above functions work.

In [6]:
s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])

s

0             Tom
1    William Rick
2            John
3         Alber@t
4             NaN
5            1234
6      SteveSmith
dtype: object

### lower()

Converts strings in the Series/Index to lower case.

In [9]:
s.str.lower()

0             tom
1    william rick
2            john
3         alber@t
4             NaN
5            1234
6      stevesmith
dtype: object

### upper()

Converts strings in the Series/Index to upper case.

In [10]:
s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t', np.nan, '1234','SteveSmith'])

s.str.upper()

0             TOM
1    WILLIAM RICK
2            JOHN
3         ALBER@T
4             NaN
5            1234
6      STEVESMITH
dtype: object

### len()

Computes String length().

In [12]:
s.str.len()

0     3.0
1    12.0
2     4.0
3     7.0
4     NaN
5     4.0
6    10.0
dtype: float64

### strip()

Helps strip whitespace(including newline) from each string in the Series/index from both the sides.

In [13]:
s.str.strip()

0             Tom
1    William Rick
2            John
3         Alber@t
4             NaN
5            1234
6      SteveSmith
dtype: object

### split('pattern ')

Splits each string with the given pattern.

In [16]:
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])
print (s)
print ("Split Pattern:")
print (s.str.split(' '))

0             Tom 
1     William Rick
2             John
3          Alber@t
dtype: object
Split Pattern:
0             [Tom ]
1    [ William Rick]
2             [John]
3          [Alber@t]
dtype: object


### cat(sep=' pattern')

Concatenates the series/index elements with given separator.

In [17]:
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print (s.str.cat(sep='_'))

Tom _ William Rick_John_Alber@t


### get_dummies()

Returns the DataFrame with One-Hot Encoded values.

In [19]:
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print (s.str.get_dummies())

    William Rick  Alber@t  John  Tom 
0              0        0     0     1
1              1        0     0     0
2              0        0     1     0
3              0        1     0     0


### contains(pattern)

Returns a Boolean value True for each element if the substring contains in the element, else False.

In [20]:
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print (s.str.contains(' '))

0     True
1     True
2    False
3    False
dtype: bool


### replace(a,b)

Replaces the value a with the value b.

In [22]:
print ("After replacing @ with $:")
print (s.str.replace('@','$'))

After replacing @ with $:
0             Tom 
1     William Rick
2             John
3          Alber$t
dtype: object


### repeat(value)

Repeats each element with specified number of times.

In [23]:
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

s.str.repeat(2)

0                      Tom Tom 
1     William Rick William Rick
2                      JohnJohn
3                Alber@tAlber@t
dtype: object

### count(pattern)

Returns count of appearance of pattern in each element.

In [24]:
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

print ("The number of 'm's in each string:")
print (s.str.count('m'))

The number of 'm's in each string:
0    1
1    1
2    0
3    0
dtype: int64


### startswith(pattern)

Returns true if the element in the Series/Index starts with the pattern

In [25]:
print ("Strings that start with 'T':")
s.str. startswith ('T')

Strings that start with 'T':


0     True
1    False
2    False
3    False
dtype: bool

### endswith(pattern)

Returns true if the element in the Series/Index ends with the pattern.

In [26]:

print ("Strings that end with 't':")
s.str.endswith('t')

Strings that end with 't':


0    False
1    False
2    False
3     True
dtype: bool

### find(pattern)

Returns the first position of the first occurrence of the pattern.

"-1" indicates that there no such pattern available in the element.

In [28]:
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

s.str.find('e')

0   -1
1   -1
2   -1
3    3
dtype: int64

### findall(pattern)

Returns a list of all occurrence of the pattern.

Null list([ ]) indicates that there is no such pattern available in the element.

In [29]:
s = pd.Series(['Tom ', ' William Rick', 'John', 'Alber@t'])

s.str.findall('e')

0     []
1     []
2     []
3    [e]
dtype: object

### swapcase

Swaps the case lower/upper.

In [30]:
s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
s.str.swapcase()

0             tOM
1    wILLIAM rICK
2            jOHN
3         aLBER@T
dtype: object

### islower()

Checks whether all characters in each string in the Series/Index in lower case or not. Returns Boolean

In [31]:
s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])
s.str.islower()

0    False
1    False
2    False
3    False
dtype: bool

### isupper()

Checks whether all characters in each string in the Series/Index in upper case or not. Returns Boolean.

In [32]:
s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])

s.str.isupper()

0    False
1    False
2    False
3    False
dtype: bool

### isnumeric()

Checks whether all characters in each string in the Series/Index are numeric. Returns Boolean.

In [33]:
s = pd.Series(['Tom', 'William Rick', 'John', 'Alber@t'])

s.str.isnumeric()

0    False
1    False
2    False
3    False
dtype: bool