Just a few minutes back, came across `Philip Vollet's` post on NLP Profiler who describes it has 
> **Simple NLP library that allows profiling datasets with one or more text columns. When given a dataset and a column name containing text data, NLP Profiler will return either high-level insights or low-level /granular statistical information about the text in that column.**

As per the developer repo, `NLP PROFILER` would be the describe() function for text data and pretty much more! <br>
NLP Profiler will return either high-level insights or low-level/granular statistical information about the dataframe text column. Ok thats cool!

I am yet to explore more on this and this is a very basic notebook that follows pretty much similar approach to the starter notebooks shared in the developer's page! <br>

Do check out the Repo and the starter notebook here - [nlp_profiler](https://github.com/neomatrix369/nlp_profiler) and
[Notebook](https://github.com/neomatrix369/nlp_profiler/blob/master/notebooks/google-colab/nlp_profiler.ipynb)

In [None]:
#Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import string
import time

In [None]:
#To ignore warning messages
import warnings
warnings.filterwarnings('ignore')

In [None]:
#Pulling the dataset
df = pd.read_csv("/kaggle/input/spooky-author-identification/train.zip")

In [None]:
#To display the full text column instead of truncating one
pd.set_option('display.max_colwidth', -1)
#To display a maximum of 100 columns
pd.set_option('display.max_columns',100)

In [None]:
df.head()

In [None]:
df.shape

We are having 19579 records in hand. We will try to cut short and go with just 100 records for our easiness.

In [None]:
#Retaining just first 100 records
df = df[:100]

In [None]:
#Python provides a constant called string.punctuation that provides a great list of punctuation characters. 
print(string.punctuation)

In [None]:
def remove_punctuations(input_col):
    """To remove all the punctuations present in the text.Input the text column"""
    table = str.maketrans('','',string.punctuation)
    return input_col.translate(table)

In [None]:
#Applying the remove_punctuation function
df['text'] = df['text'].apply(remove_punctuations)

#### Pip installing the NLP Profiler directly from the GitHub repo
 !pip install git+https://github.com/neomatrix369/nlp_profiler.git@master

In [None]:
!pip install git+https://github.com/neomatrix369/nlp_profiler.git@master

In [None]:
#Importing the apply_text_profiling
from nlp_profiler.core import apply_text_profiling

In [None]:
#Applying on the text column of the dataframe
#Official git mentions Pandas dataframe series as input param to be passed
start = time.time()
profiled_df = apply_text_profiling(df,'text')
end = time.time()
total_time = end - start / 60*60
print("Time taken(in secs) for the apply_text_profiling to run on 100 records: ",total_time)

In [None]:
profiled_df.head(2)

We can observe now what all features `apply_text_profiling` function as got!

In [None]:
profiled_df.columns

Sentiment Polarity, Subjectivity and spelling quality would be some great features to check on top of our text data.

In [None]:
#Hist plot for the sentiment polarity for the first 100 sentences
profiled_df['sentiment_polarity'].hist()
plt.title("Sentiment Polarity")
plt.show()

OK! So apply_text_profiling is saying most of the sentences by spooky authors are pretty positive sentences! BTW dont forget we are running this just on 100 sentences, not the entire train set!! <b>
I feel sentiment Polarity will surely help in getting a gist on the underlying data!

In [None]:
#Subjective or Objective sentence
profiled_df['sentiment_subjectivity_summarised'].hist()
plt.title("Sentiment Subjectivity")
plt.show()

Pretty Much Subjective sentences!!!

In [None]:
#Histogram on the words_count
profiled_df['words_count'].hist()
plt.title("Word Count Distribution with NLP_Profiler")
plt.show()

In [None]:
#Average stop word count with the sentences
profiled_df['stop_words_count'].mean()

In [None]:
sns.heatmap(profiled_df[['sentiment_polarity_score','sentiment_subjectivity_score']].corr(),annot=True,cmap='Blues')
plt.title("Correlation Between Sentiment Polarity and Sentiment Subjectivity")
plt.xticks(rotation=45)
plt.show()