## 6. Essential Basic Functionality - II

### (a) Function application

To apply your own or another library’s functions to pandas objects, you should be aware of the three methods below. The appropriate method to use depends on whether your function expects to operate on an entire DataFrame or Series, row- or column-wise, or elementwise.

<pre>
i) Tablewise function application: <b>pipe()</b>
ii) Row or Column Wise function application: <b>apply()</b>
iii) Applying elementwise functions: <b>applymap()</b>
</pre>

In [1]:
import pandas as pd
import numpy as np

#### (i) Tablewise function application:
    
DataFrames and Series can be passed into functions. However, if the function needs to be called in a chain, consider using the pipe() method.<br>

Consider two sample functions one of which splits the full name and the other adds the country of origin.

In [2]:
def add_name(df):
    df['first_name'] = df['full_name'].str.split(" ").str.get(0)
    df['last_name'] = df['full_name'].str.split(" ").str.get(1)
    return df

def add_country(df, country=None):
    df['country'] = [country[i] for i in range(2)] 
    return df

In [3]:
df = pd.DataFrame()
df['full_name'] = ['Rishav Sharma', 'Python code']
add_country(add_name(df), ['India', 'Finland'])

Unnamed: 0,full_name,first_name,last_name,country
0,Rishav Sharma,Rishav,Sharma,India
1,Python code,Python,code,Finland


<b>The above code is equivalent to:</b>

In [4]:
df.pipe(add_name).pipe(add_country, country = ['Germany', 'Ghana'])

Unnamed: 0,full_name,first_name,last_name,country
0,Rishav Sharma,Rishav,Sharma,Germany
1,Python code,Python,code,Ghana


<u><b>Note</b> </u> : Pandas encourages the second style, which is known as method chaining. pipe() makes it easy to use your own or another library’s functions in method chains, alongside pandas’ methods.

#### ii) Row or Column Wise function application

Arbitrary functions can be applied along the axes of a DataFrame using the apply() method, which, like the descriptive statistics methods, takes an optional axis argument:

In [5]:
array1 = np.array([np.random.randint(1,5,4), np.random.randint(-1,5,4), np.random.randint(-5,+10,4)])
df1 = pd.DataFrame(array1, columns=['col1', 'col2', 'col3', 'col4'])
df1

Unnamed: 0,col1,col2,col3,col4
0,3,2,3,1
1,4,-1,1,-1
2,1,-5,5,9


In [6]:
df1.apply(lambda x: x+1)                 # Adds 1 to every column elements

Unnamed: 0,col1,col2,col3,col4
0,4,3,4,2
1,5,0,2,0
2,2,-4,6,10


The apply() method will also dispatch on a string method name.

In [7]:
df1.apply("std", axis=1)

0    0.957427
1    2.362908
2    5.972158
dtype: float64

Also, apply() takes an argument raw which is False by default, which converts each row or column into a Series before applying the function. When set to True, the passed function will instead receive an ndarray object, which has positive performance implications if you do not need the indexing functionality.

#### (iii) Applying elementwise functions: applymap()

Since not all functions can be vectorized (accept NumPy arrays and return another array or value), the methods applymap() on DataFrame and analogously map() on Series accept any Python function taking a single value and returning a single value.

In [8]:
def f(x):
    return x-5

In [9]:
df1['col3'].map(f)                # Works with a series

0   -2
1   -4
2    0
Name: col3, dtype: int64

In [10]:
df1.applymap(f)

Unnamed: 0,col1,col2,col3,col4
0,-2,-3,-2,-4
1,-1,-6,-4,-6
2,-4,-10,0,4


<hr>

### (b) Aligning objects with each other with align()

The align() method is the fastest way to simultaneously align two objects. It supports a join argument (related to joining and merging):

<ul>

<li>join='outer': take the union of the indexes (default)

<li>join='left': use the calling object’s index

<li>join='right': use the passed object’s index

<li>join='inner': intersect the indexes
    
</ul>

<b> Let's look at an example to understand this.</b>

In [11]:
df2 = pd.DataFrame([[1,2,3,4], [6,7,8,9]], columns=['D', 'B', 'E', 'A'], index=[1,2])
df3 = pd.DataFrame([[10,20,30,40], [60,70,80,90], [600,700,800,900]], columns=['A', 'B', 'C', 'D'], index=[2,3,4])
df2

Unnamed: 0,D,B,E,A
1,1,2,3,4
2,6,7,8,9


In [12]:
df3

Unnamed: 0,A,B,C,D
2,10,20,30,40
3,60,70,80,90
4,600,700,800,900


Let's align these two dataframes, aligning by columns (axis=1), and performing an outer join on column labels (join='outer'):

In [13]:
a1, a2 = df2.align(df3, join="outer", axis=1)
a1

Unnamed: 0,A,B,C,D,E
1,4,2,,1,3
2,9,7,,6,8


In [14]:
a2

Unnamed: 0,A,B,C,D,E
2,10,20,30,40,
3,60,70,80,90,
4,600,700,800,900,


##### Points to notice:

<pre>
i) The columns in <b>df2</b> have been rearranged so they align with the columns in <b>df3</b>.

ii)There is a column labelled 'C' that has been added to <b>df2</b>, and a column labelled 'E' that has been added to <b>df3</b>. These columns have been filled with NaN. This is because we performed an <u>outer join</u> on the column labels.

iii) None of the values inside the DataFrames have been altered.

iv) Note that the row labels are not aligned; <b>df3</b> has rows 3 and 4, whereas <b>df2</b> does not. This is because we requested alignment on columns (axis=1).
</pre>

<b>In summary, use DataFrame.align() when you want to make sure the arrangement of rows and/or columns is the same between two dataframes, without altering any of the data contained within the two dataframes.</b>

<a href="https://stackoverflow.com/questions/51645195/pandas-align-function-illustrative-example/51645550"> align() illustrative example </a>

<hr>

### (c) Dropping labels from an axis

It removes a set of labels from an axis.

In [15]:
df1

Unnamed: 0,col1,col2,col3,col4
0,3,2,3,1
1,4,-1,1,-1
2,1,-5,5,9


In [16]:
df1.drop(columns=['col1', 'col3'], axis=1, inplace=True)            # inplace=True ensures that the original data frame is updated
df1

Unnamed: 0,col2,col4
0,2,1
1,-1,-1
2,-5,9


In [17]:
df1.drop(index=[1, 2], axis=0, inplace=True)                             # removing rows
df1

Unnamed: 0,col2,col4
0,2,1


<hr>

### (d) Iteration

The behavior of basic iteration over pandas objects depends on the type. When iterating over a Series, it is regarded as array-like, and basic iteration produces the values. DataFrames follow the dict-like convention of iterating over the “keys” of the objects.

In [18]:
df3

Unnamed: 0,A,B,C,D
2,10,20,30,40
3,60,70,80,90
4,600,700,800,900


In [19]:
for cols in df3:
    print(cols)

A
B
C
D


In [20]:
for index, row in df3.iterrows():
    print(index)

2
3
4


In [21]:
for label,series in df3.items():
    print(series)

2     10
3     60
4    600
Name: A, dtype: int64
2     20
3     70
4    700
Name: B, dtype: int64
2     30
3     80
4    800
Name: C, dtype: int64
2     40
3     90
4    900
Name: D, dtype: int64


<div class="alert alert-danger alertdanger">
<h3> Warning: </h3>

Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is not needed and can be avoided with one of the following approaches:
<ol>
        <li>Look for a vectorized solution: many operations can be performed using built-in methods or NumPy functions, (boolean) indexing, …
        <li>When you have a function that cannot work on the full DataFrame/Series at once, it is better to use apply() instead of iterating over the values.
</ol>

</div>

<hr>

### (e) Vectorized string methods (Working with text data)

<a href="https://pandas.pydata.org/docs/user_guide/text.html#text-string-methods">Complete: Here!</a> 

<a href="https://www.youtube.com/watch?v=bofaC0IckHo&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=13">Strings in Pandas (Tutorial) </a>

In [22]:
df

Unnamed: 0,full_name,first_name,last_name,country
0,Rishav Sharma,Rishav,Sharma,Germany
1,Python code,Python,code,Ghana


In [23]:
df.full_name = df.full_name.str.upper()
df

Unnamed: 0,full_name,first_name,last_name,country
0,RISHAV SHARMA,Rishav,Sharma,Germany
1,PYTHON CODE,Python,code,Ghana


<hr>

### (f) Sorting

<b>Pandas supports three kinds of sorting: sorting by index labels, sorting by column values, and sorting by a combination of both.</b>

#### (i) By index

In [24]:
df4 = pd.DataFrame([[1, -1, 5], [21, -8, 11], [44, 57, 1], [11, 0.5, 0.9]], index=[7, 5, 8, 3])
df4

Unnamed: 0,0,1,2
7,1,-1.0,5.0
5,21,-8.0,11.0
8,44,57.0,1.0
3,11,0.5,0.9


In [25]:
df4.sort_index(ascending=False)             # Sorts according to index values

Unnamed: 0,0,1,2
8,44,57.0,1.0
7,1,-1.0,5.0
5,21,-8.0,11.0
3,11,0.5,0.9


#### (ii) By values

In [26]:
df4

Unnamed: 0,0,1,2
7,1,-1.0,5.0
5,21,-8.0,11.0
8,44,57.0,1.0
3,11,0.5,0.9


In [27]:
df4.sort_values(by=1)                # Sort the data frame by any specific column

Unnamed: 0,0,1,2
5,21,-8.0,11.0
7,1,-1.0,5.0
3,11,0.5,0.9
8,44,57.0,1.0


In [28]:
df4.sort_values(by=1, key=lambda x: x**2)         # sorting based only on the magnitude of the second column

Unnamed: 0,0,1,2
3,11,0.5,0.9
7,1,-1.0,5.0
5,21,-8.0,11.0
8,44,57.0,1.0


#### (iii) By index and values both

<a href="https://betterprogramming.pub/sorting-a-python-pandas-dataframes-by-index-and-value-7306ac754014"> All possible types of sortings in pandas </a>

<hr>

<a href="https://pandas.pydata.org/docs/user_guide/style.html">Hack: Pandas Styling Data Frames.</a>

<a href="https://www.youtube.com/watch?v=wDYDYGyN_cw&list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y&index=22">How do I make my Pandas DataFrame smaller and faster?</a>