### The pandas .apply() command and its applications ###

The apply() function in pandas is used to apply a function along a specific axis of a DataFrame or a Series.

The apply() function takes a function as an argument and applies it to each row or column of the DataFrame or Series. It can be used to transform or aggregate data, calculate new columns based on existing ones, or apply any other custom function to the data.

The function that is passed to apply() can be a built-in Python function or a custom function that you define. When using apply(), the function is applied to each row or column of the DataFrame or Series, and the results are returned in a new DataFrame or Series.

#### Example of .apply() ####

In this example, the sum_column() function is defined to calculate the sum of a column, and the apply() function is used to apply the function to each column of the DataFrame. The axis=0 argument specifies that the function should be applied to each column, rather than each row. The result is a new Series that contains the sum of each column.

In [1]:
import pandas as pd # Import the Pandas library
import numpy as np # Import the NumPy library

In [2]:
data = {'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]} # Create a dictionary of data
df = pd.DataFrame(data) # Create a DataFrame from the dictionary
print(df,'\n') # Print the DataFrame

def sum_column(column): # Define a function that takes a column
    return column.sum() # Return the sum of the column

result = df.apply(sum_column, axis=0) # Apply the f+unction to each column of the DataFrame

print(result) # Print the result

   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9 

A     6
B    15
C    24
dtype: int64


The default axis for the apply() function in pandas is axis=0, which means that the function is applied to each column of a DataFrame or Series. However, you can also specify axis=1 to apply the function to each row of a DataFrame or Series.

In [3]:
# Along rows example
def sum_row(row): # Define a function to calculate the sum of a row
    return row.sum() # Return the sum of the row

result = df.apply(sum_row, axis=1) # Apply the function to each row of the DataFrame

print(result) # Print the result

0    12
1    15
2    18
dtype: int64


#### Apart from the functions you are used to writing, there is also something called lambda functions ####

#### What are lambda functions in python ####

When you need a function to do a particular thing but only once, we use lambda functions.<br>

    The keyword: lambda
    A bound variable: x
    A body: x


In [4]:
# Example 1

z = (lambda x: x + 1)
print(z(2))

3


In [5]:
# Example 2
# Know that a string in a logic statement will always return 'True'. Also, True == 1 or > int.
# False == 0

(lambda x:(x % 2 and 'odd' or 'even'))(3)

'odd'

In [6]:
# Lets use lambda function in .apply() it to do something on the dataframe we created above

result = df.apply(lambda x: x.mean(), axis=0) # Calculate the mean of each column with lambda

print(result)

A    2.0
B    5.0
C    8.0
dtype: float64


### Let us use .apply() command in the context of text processing ###

In [7]:
docs = ['The sky is blue.','The sun is bright.','The sun in the sky is bright', 'We can see the shining sun, the bright sun.'] # Create a corpus of text as a list

In [8]:
# Make it a dataframe with features (the text) and labels

data = pd.DataFrame(docs, columns=['text']) # Create a DataFrame from the list of documents
data['class'] = [1,1,0,1] # Add a labels column with some arbitrary values
data # Print the DataFrame

Unnamed: 0,text,class
0,The sky is blue.,1
1,The sun is bright.,1
2,The sun in the sky is bright,0
3,"We can see the shining sun, the bright sun.",1


In [9]:
# Use some of the nltk processing tools

from nltk.tokenize import word_tokenize # Import the word_tokenize function
from nltk.corpus import stopwords # Import the stopwords corpus
from nltk.stem import PorterStemmer # Import the PorterStemmer function

In [10]:
data['text'] = data.text.str.lower() # Convert all text to lower case
data # Print the DataFrame

Unnamed: 0,text,class
0,the sky is blue.,1
1,the sun is bright.,1
2,the sun in the sky is bright,0
3,"we can see the shining sun, the bright sun.",1


In [11]:
stop = set(stopwords.words("english")) # Instantiate stopwords,
stem = PorterStemmer() # Instantiate the stemmer objects

In [12]:
# Remove stopwords using .apply() and a lambda function - notice we use lambda as we do this processing only once
# The .split() function has the similar effect as tokenizing a string - splits a sentence into a list of its words
# We split the sentence, then look at each word in a for loop (list comprehension) and check if it is in our stopwords list
# The ' '.join() command brings the words back into a sentence with the required space between each word

data['text'] = data['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop])) # Remove stopwords

In [13]:
data['text'] # Print the text column

0                       sky blue.
1                     sun bright.
2                  sun sky bright
3    see shining sun, bright sun.
Name: text, dtype: object

In [14]:
# Stemming using .apply() and a lambda function - notice we use lambda as we do this processing only once

data['text'] = data['text'].apply(lambda x: ' '.join([stem.stem(word) for word in x.split()]))

In [15]:
data['text'] # Print the text column

0                     sky blue.
1                   sun bright.
2                sun sky bright
3    see shine sun, bright sun.
Name: text, dtype: object

In [16]:
data # Print the DataFrame

Unnamed: 0,text,class
0,sky blue.,1
1,sun bright.,1
2,sun sky bright,0
3,"see shine sun, bright sun.",1


In [17]:
# One can put all this in a defiend funtion and then apply it

data1 = pd.DataFrame(docs, columns=['text']) # Create ba new dataframe

def clean_text(texts): # A function which does all the preprocessing
    words = "" # Create an empty string to store the words
    
    # Create the stemmer and stopwords
    stem = PorterStemmer() # Instantiate the stemmer object
    stop = set(stopwords.words("english")) # Instantiate the stopwords
    
    # Split text into words.
    texts = texts.split() # Split the text into words
    for word in texts: # Loop through each word in the text
        if word not in stop and word.isalnum():       # Remove stopword and punctuation
            words = words + stem.stem(word) + " "     # Stemming (or Lemmatize)
    
    return words.lower() # Return the words in lower case

# The preprocessing function is used in .apply()
data1['text'] = data1['text'].apply(clean_text) # Apply the function to the text column
data1 # Print the DataFrame

Unnamed: 0,text
0,the sky
1,the sun
2,the sun sky bright
3,we see shine bright


### After processing the Text into the required form we need to transform it ###

In [18]:
# TrainTest Split the document after separating X and y
X = data['text'] # Create the feature matrix
y = data['class'] # Create the labels vector

In [19]:
from sklearn.model_selection import train_test_split

In [20]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=2) # Split the data into training and testing sets
X_train,X_test # Print the training and testing sets

(3    see shine sun, bright sun.
 1                   sun bright.
 Name: text, dtype: object,
 0         sky blue.
 2    sun sky bright
 Name: text, dtype: object)

In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer # Import the TfidfVectorizer

In [22]:
# Create tfidf object

tfidf = TfidfVectorizer() # Instantiate the TfidfVectorizer object
tfidf.fit(X_train) # Fit the TfidfVectorizer object to the training data

In [23]:
# TF-IDF transformation and creating the X_train and X_tes dense matrices

X_train_vec = tfidf.transform(X_train) # Transform the training data into a document-term matrix
X_test_vec = tfidf.transform(X_test) # Transform the testing data into a document-term matrix

print(f"X_train matrix:\n{X_train_vec.todense()}\n\n and X_test Matrix:\n{X_test_vec.todense()}")

X_train matrix:
[[0.33425073 0.46977774 0.46977774 0.66850146]
 [0.70710678 0.         0.         0.70710678]]

 and X_test Matrix:
[[0.         0.         0.         0.        ]
 [0.70710678 0.         0.         0.70710678]]


Now we have the data in a form where we can apply the algorithms we wish to use.