### Prequels/sequels

- [FastChai sessions: Tweet Sentiment Extraction (data-prep)](https://www.kaggle.com/neomatrix369/fastchai-tweet-sentiment-extraction-data-prep/) | [Extended Dataset](https://www.kaggle.com/neomatrix369/tweet-sentiment-extraction-extended)
- **FastChai sessions: Tweet Sentiment Extraction (analysis)**

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re
import string

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import warnings
from math import pi
import seaborn as sns
import plotly.express as px
import matplotlib
import matplotlib.pyplot as plt
from IPython.display import HTML,display

sns.set(style="whitegrid", font_scale=1.75)


# prettify plots
plt.rcParams['figure.figsize'] = [20.0, 5.0]
    
%matplotlib inline

warnings.filterwarnings("ignore")

percentiles = [value/100 for value in range(10, 100, 10)] + [0.05, 0.25, 0.75, 0.95]

## Loading datasets

In [None]:
EXTENDED_DATA_FOLDER='/kaggle/input/tweet-sentiment-extraction-extended'

In [None]:
def reduce_memory_footprint(dataset: pd.DataFrame, columns_to_convert: list = [], drop_na: bool=False) -> pd.DataFrame:
    columns_converted = []
    columns_failed = []
    
    if not columns_to_convert:
        columns_to_convert = dataset.columns

    if drop_na:
        dataset = dataset.dropna()

    for each_col in columns_to_convert:
        if hasattr(dataset[each_col], 'dtype'):
            if dataset[each_col].dtype is np.dtype('O'):
                dataset[each_col] = dataset[each_col].astype('category')
                columns_converted.append(each_col)
        else:
            columns_failed.append(each_col)
        
    print(f'Columns converted to type category: {columns_converted}')
    print(f'Columns that do NOT have the attribute "dtype": {columns_failed}')
    return dataset

## Original datasets

In [None]:
training_dataset = reduce_memory_footprint(pd.read_csv(f'{EXTENDED_DATA_FOLDER}/train.csv'), drop_na=True)
test_dataset = reduce_memory_footprint(pd.read_csv(f'{EXTENDED_DATA_FOLDER}/test.csv'), drop_na=True)

### Hypothesis/question: can we find 'selected_text' in every 'text' column of the training dataset?

It's possible there is a percentage of rows where they do not match exactly (bad data or another reason)

In [None]:
def jaccard_score(str1: str, str2: str) -> float: 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

In [None]:
def is_selected_text_in_text(dataset: pd.DataFrame) -> list:
    results = []
    for text_str, selected_text_str in zip(dataset['text'].values, dataset['selected_text'].values):
        text_str, selected_text_str = str(text_str), str(selected_text_str)
        if selected_text_str in text_str:
            results.append('Yes') 
        else:
            results.append('No')
            
    return results

training_dataset['is_selected_text_in_text'] = is_selected_text_in_text(training_dataset)

In [None]:
def plot_chart(dataframe, width=20.0, height=7.0, title='<No title assigned>', 
               ylabel_title='<No xlabel title assigned>', stacked=False):
    plt.rcParams['figure.figsize'] = [width, height]
    font = {'size': 16}
    matplotlib.rc('font', **font)
    ax = dataframe.plot(kind='barh', stacked=stacked, title=title)
    ax.set_ylabel(ylabel_title)

In [None]:
def plot_sns_chart(dataframe, y_axis, class_separator=None, width=20.0, height=9.0, 
                   font_scale=2, title = 'Title has not been set',
                   xlabel_title="Xlabel title not been set", ylabel_title="Ylabel title not been set"):
    plt.rcParams['figure.figsize'] = [width, height]
    sns.set(font_scale=font_scale)
    g = sns.countplot(y=y_axis, hue=class_separator, data=dataframe)
    g.set(xlabel=xlabel_title, ylabel=ylabel_title)
    g.set_title(title)
    total = float(len(dataframe))
    margin = 0.0025
    for patch in g.patches:
        percentage = ' {:.1f}%'.format(100 * patch.get_width() / total)
        x = patch.get_width() + (patch.get_width() * margin)
        y = patch.get_y() + (patch.get_height() / 2) 
        g.annotate(percentage, (x, y), va='center') 

In [None]:
plot_sns_chart(training_dataset, 'is_selected_text_in_text', width=35.0, height=2.0, 
               title=f'Is the text in selected_text present in the text column.',
               ylabel_title='Is selected_text present in text', 
               xlabel_title=f'Number of rows. Total rows {training_dataset.shape[0]}')

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Conclusion: all the values in the `selected_text` column can be found in the `text` column

We can further use the `jaccard` score function to measure the degree of similar between the two columns, even though we know that only a subset of the text in the `text` column will be present in the `selected_text` columns.

In [None]:
def apply_jaccard_similarity(selected_text_col, text_col) -> float:
    selected_texts = selected_text_col.values
    texts = text_col.values
    scores = []
    for selected_text, text in zip(selected_texts, texts):
        scores.append(jaccard_score(selected_text, text))
    return scores    

In [None]:
training_dataset['selected_text_and_text_jaccard_similarity'] = apply_jaccard_similarity(training_dataset['selected_text'], training_dataset['text'])

In [None]:
plt.rcParams['figure.figsize'] = [20, 10]
training_dataset['selected_text_and_text_jaccard_similarity'].hist()
plt.xlabel('Jaccard similarity score') 
plt.title('Jaccard similarity between selected_text and text (overall)')

In [None]:
_, plots = plt.subplots(3,1)
plt.rcParams['figure.figsize'] = [20, 30]

sentiment = training_dataset['sentiment'] == 'positive'
training_dataset['selected_text_and_text_jaccard_similarity'][sentiment].hist(ax=plots[0])
plots[0].set_title('positive')

sentiment = training_dataset['sentiment'] == 'neutral'
training_dataset['selected_text_and_text_jaccard_similarity'][sentiment].hist(ax=plots[1])
plots[1].set_title('neutral')

sentiment = training_dataset['sentiment'] == 'negative'
training_dataset['selected_text_and_text_jaccard_similarity'][sentiment].hist(ax=plots[2])
plots[2].set_title('negative')

plt.xlabel('Jaccard similarity between selected_text and text (by sentiments)')
plt.ylabel('Jaccard similarity score') 
plt.tight_layout()

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Looking at the jaccard similarity scores between the different sentiments it appears that the patterns of dissimilarity are the same between the `positive` and `negative` sentiments while that of `neutral` sentiment is clearly skewed right and also where majority of them are similar.

In [None]:
jaccard_similarity_by_sentiment_pivot_table = training_dataset.pivot_table('selected_text_and_text_jaccard_similarity', 'sentiment')
jaccard_similarity_by_sentiment_pivot_table = jaccard_similarity_by_sentiment_pivot_table.T.reset_index(drop=True)

In [None]:
plt.rcParams['figure.figsize'] = [20, 5]
g = jaccard_similarity_by_sentiment_pivot_table.plot(kind='barh')
sns.set(font_scale=2)
plt.xlabel('Jaccard similarity') 
plt.ylabel('Sentiments') 

plt.title('Jaccard similarity between selected_text and text (by sentiments)')
total = 3.0
margin = 0.00001
for patch, column in zip(g.patches, jaccard_similarity_by_sentiment_pivot_table.columns):
    value = jaccard_similarity_by_sentiment_pivot_table[0:1][column][0]
    percentage = ' {:.1f}%'.format(value * 100)
    x = patch.get_width() + (patch.get_width() * margin)
    y = patch.get_y() + (patch.get_height() / 2) 
    g.annotate(percentage, (x, y), va='center') 
plt.legend(loc='center')

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">We can see that for `neutral` sentiment texts we have almost a `100%` (i.e. `97%`) similarity between the texts in the columns `selected_text` and `text`. Meaning for most of the rows `selected_text` and `text` are the same. But for `positive` and `negative` sentiments the degree of similar are between `31%` and `34%`, meaning that only a subset of the text from the columns `text` is present in `selected_text`.

## Training dataset: text column

In [None]:
selected_repeat_columns = [
    'repeated_letters_count', 'repeated_digits_count', 'repeated_spaces_count', 
    'repeated_whitespaces_count', 'repeated_punctuations_count', 'english_characters_count', 
    'non_english_characters_count'
]

In [None]:
profiled_train_text_sentiment = pd.read_csv(f'{EXTENDED_DATA_FOLDER}/profiled_train_text_sentiment.csv')
profiled_train_text_ease_of_reading = pd.read_csv(f'{EXTENDED_DATA_FOLDER}/profiled_train_text_ease_of_reading.csv')
profiled_train_text_grammar = pd.read_csv(f'{EXTENDED_DATA_FOLDER}/profiled_train_text_grammar_check.csv')
profiled_train_text_spelling = pd.read_csv(f'{EXTENDED_DATA_FOLDER}/profiled_train_text_spelling_check.csv')
profiled_train_text_granular = pd.read_csv(f'{EXTENDED_DATA_FOLDER}/profiled_train_text_granular_features.csv')

In [None]:
profiled_train_text = pd.concat([training_dataset[['text', 'sentiment']], profiled_train_text_sentiment, profiled_train_text_ease_of_reading, 
                                 profiled_train_text_grammar, profiled_train_text_spelling, profiled_train_text_granular], axis=1)
profiled_train_text = reduce_memory_footprint(profiled_train_text)
profiled_train_text['sentiment_polarity_summarised'] = profiled_train_text['sentiment_polarity_summarised'].apply(lambda x: x.lower())

In [None]:
profiled_train_text

### Hypothesis/question: is the sentiment of the 'text' column correct?

It's possible there is a percentage of rows where the sentiment may be incorrect or different from what we would expect (incorrect data or another reason)

In [None]:
def check_if_sentiment_of_text_matches(dataset: pd.DataFrame) -> list:
    result = []
    for sentiment_str, sentiment_polarity_summarised_str in zip(dataset.sentiment.values, dataset.sentiment_polarity_summarised.values):
        sentiment_str, sentiment_polarity_summarised_str = str(sentiment_str).strip(), str(sentiment_polarity_summarised_str).strip()
        if sentiment_str == sentiment_polarity_summarised_str:
            result.append('Yes')
        else:
            result.append('No')
    return result


In [None]:
profiled_train_text['do_sentiments_match'] = check_if_sentiment_of_text_matches(profiled_train_text)
profiled_train_text['do_sentiments_match'] = profiled_train_text['do_sentiments_match'].astype('category')

In [None]:
plot_sns_chart(profiled_train_text, 'do_sentiments_match', width=35.0, height=3.0, 
               title=f'Do the sentiments for "text" provided match-up.',
               ylabel_title='Sentiments match', 
               xlabel_title=f'Number of rows. Total rows {profiled_train_text.shape[0]}')

In [None]:
plot_sns_chart(profiled_train_text, 'sentiment', class_separator='do_sentiments_match', width=35.0, height=8.0, 
               title=f'Do the sentiments for "text" provided match-up (grouped by Did sentiments match).',
               ylabel_title='Sentiments match', 
               xlabel_title=f'Number of rows. Total rows {profiled_train_text.shape[0]}')

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Conclusion: sentiment across a number of rows do not match for the 'text' column. The matching and mismatching ratios are `58.7%` and `41.3%` respectively. Also from the breakdown, it appears that a portion of the `neutral` and `negative` sentiment text do not match.  If we look at the `41.3%` of the sentiments that do not match up, they are split across the three sentiments in this order: `positive` (`6.3%`), `neutral` (`19.3%`) and `negative` (`15.5`). The mismatching rows could be earmarked to verify during model analysis and and if it this impact our machine learning or model building in any form, it is to be seen.

### Hypothesis/question: does the 'text' or 'selected_text' column have spelling mistakes?

It's possible there is a percentage of rows where the text in the text or selected_text columns have one or more spelling mistakes (incorrect data or another reason)

In [None]:
plot_sns_chart(profiled_train_text, 'spelling_quality_summarised', width=35.0, height=4.0, 
               title=f'Are the spellings correct in "text".',
               ylabel_title='Spelling quality', 
               xlabel_title=f'Number of rows. Total rows {profiled_train_text.shape[0]}')

In [None]:
plot_sns_chart(profiled_train_text, 'sentiment', class_separator='spelling_quality_summarised', width=35.0, height=8.0, 
               title=f'Are the spellings correct in "text" (grouped by Spelling Quality).',
               ylabel_title='Sentiments match', 
               xlabel_title=f'Number of rows. Total rows {profiled_train_text.shape[0]}')

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Conclusion: `41.2%` of rows have spelling mistakes. If we look at the `41.2%` of the texts where the spelling is `Bad`, this trait is split across the three sentiments in this order: `positive` (`13.2%`), `neutral` (`17.1%`) and `negative` (`10.9%`). `41.8` is still a big number -- and would this impact our machine learning or model training process in any form, it is to be seen.

### Hypothesis/question: are the 'text' or 'selected_text' columns ease to read?

It's possible there is a percentage of rows where the text in the `text` or `selected_text` columns are not easy to read for humans or ML models (incorrect data or another reason)

In [None]:
plot_sns_chart(profiled_train_text, 'ease_of_reading_summarised', width=35.0, height=6.0, 
               title=f'Is the text in "text" column easy to read',
               ylabel_title='Ease of reading', 
               xlabel_title=f'Number of rows. Total rows {profiled_train_text.shape[0]}')

In [None]:
plot_sns_chart(profiled_train_text, 'sentiment', class_separator='ease_of_reading_summarised', width=35.0, height=12.0, 
               title=f'Is the text in "text" column easy to read (grouped by Ease of Reading Quality).',
               ylabel_title='Ease of reading', 
               xlabel_title=f'Number of rows. Total rows {profiled_train_text.shape[0]}')

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Conclusion: It appears that `76.2%` (`65.4%` + `10.8%`) of the `text` (also would apply to the `selected_text` column) in the training dataset are readable (human/machine understandable). It would be good to assess the rest of the `23.8%` across the different sentiments and tally with other findings above. If we look at the the texts, this trait is split across the three sentiments in this order: `Standard` or `Easy` to read: `positive` (`23.8%`), `neutral` (`30.9%`) and `negative` (`22.3%`), while `Confusing` or `Difficult` to read: `positive` (`5.4%`), `neutral` (`6.3%`) and `negative` (`0.8%`). This can help us determine that during the ML training/inference, if these entries are going to impede the machine learning or model building process.

### Hypothesis/question: (a) 'text' column contains two or more spaces, whitespaces and/or (b) the 'text' column contains two or more repeated letters, digits, (c) and/or two or more repeated punctuations marks

It's possible there is a percentage of rows where the there are repeated spaces, whitespaces, alphanumeric characters and/or punctuations marks.

In [None]:
def plot_repeated_entities_groups(dataset):
    new_dataset = dataset.copy()
    shortened_column_names = ['letters', 'digits', 'spaces', 'whitespaces', 
                              'punctuations', 'english', 'non-english']
    _, plots = plt.subplots(len(selected_repeat_columns), 1)
    plt.rcParams['figure.figsize'] = [20.0, 12.5]
    filter_greater_than_zero = new_dataset[selected_repeat_columns] > 0
    filtered_dataset = new_dataset[filter_greater_than_zero]
    for index, each_column in enumerate(selected_repeat_columns):
        g = filtered_dataset[each_column].hist(ax=plots[index], bins=5)
        g.set_ylabel(shortened_column_names[index])

    plt.xlabel('Number of groups of repeated entities')
    plt.tight_layout()
    
    del new_dataset

In [None]:
filter_columns = profiled_train_text[selected_repeat_columns] > 0
profiled_train_text[filter_columns][selected_repeat_columns].describe(percentiles=percentiles)

In [None]:
plot_repeated_entities_groups(profiled_train_text)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Conclusion: A number of rows have one or more groups of characters (letters, digits, spaces, whitespace, punctuations) that are repeated two or more times in the text column of the training dataset. But on the whole these are a smaller number of rows of text as compared to the whole corpus in the training dataset. That is from a total of `27,481` rows, we have only `2,157` (`7.84%`), `550` (`2.00%`), `9,653`  (`35.13%`), `9,653` (`35.13%`), and `8,702` (`31.67%`) rows of repeated three or more repeated letters, two or more repeated digits, two or more repeated spaces, two or more repeated whitespaces and two or more repeated punctuations respectively. The split between **English** and **non-English** characters across the texts is `27,480` (`99.38%`), and `172` (`0.62%`) -- this particular statistics means, we don't need to worry about it at the moment. But looking at the big-picture if these rows will impact our machine learning or model building in any form, it is to be seen.

## Training dataset: selected_text column

In [None]:
profiled_train_selected_text_sentiment = pd.read_csv(f'{EXTENDED_DATA_FOLDER}/profiled_train_selected_text_sentiment.csv')
profiled_train_selected_text_ease_of_reading = pd.read_csv(f'{EXTENDED_DATA_FOLDER}/profiled_train_selected_text_ease_of_reading.csv')
profiled_train_selected_text_grammar = pd.read_csv(f'{EXTENDED_DATA_FOLDER}/profiled_train_selected_text_grammar_check.csv')
profiled_train_selected_text_spelling = pd.read_csv(f'{EXTENDED_DATA_FOLDER}/profiled_train_selected_text_spelling_check.csv')
profiled_train_selected_text_granular = pd.read_csv(f'{EXTENDED_DATA_FOLDER}/profiled_train_selected_text_granular_features.csv')

In [None]:
profiled_train_selected_text = pd.concat([training_dataset['selected_text'], profiled_train_selected_text_sentiment,
                                          profiled_train_selected_text_ease_of_reading, profiled_train_selected_text_grammar, 
                                          profiled_train_selected_text_spelling, profiled_train_selected_text_granular], axis=1)
profiled_train_selected_text = reduce_memory_footprint(profiled_train_selected_text)

In [None]:
profiled_train_selected_text

### Hypothesis/question: (a) 'selected_text' column contains two or more spaces, whitespaces and/or (b) the 'selected_text' column contains two or more repeated letters, digits, (c) and/or two or more repeated punctuations marks

It's possible there is a percentage of rows where the there are repeated spaces, whitespaces, alphanumeric characters and/or punctuations marks.

In [None]:
filter_columns = profiled_train_selected_text[selected_repeat_columns] > 0
profiled_train_selected_text[selected_repeat_columns][filter_columns].describe(percentiles=percentiles)

In [None]:
plot_repeated_entities_groups(profiled_train_selected_text)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Conclusion: A number of rows have one or more groups of characters (letters, digits, spaces, whitespace, punctuations) that are repeated two or more times in the `selected_text` column of the training dataset. But on the whole these are a smaller number of rows of text as compared to the whole corpus in the training dataset. That is from a total of `27,481` rows, we have only `1,287`, `261`, `4,104`, `4,104`, and `4,593` rows of repeated three or more repeated letters, two or more repeated digits, two or more repeated spaces, two or more repeated whitespaces and	two or more repeated punctuations respectively. The split between **English** and **non-English** characters is `27,480` (`~99%`) and `86` (`<1%`) respectively. We have seen similar numbers previously. But again looking at the big-picture, if these rows will impact our machine learning or model building in any form, it is to be seen.

## Test dataset: text column

In [None]:
profiled_test_text_sentiment = pd.read_csv(f'{EXTENDED_DATA_FOLDER}/profiled_test_text_sentiment.csv')
profiled_test_text_ease_of_reading = pd.read_csv(f'{EXTENDED_DATA_FOLDER}/profiled_test_text_ease_of_reading.csv')
profiled_test_text_grammar = pd.read_csv(f'{EXTENDED_DATA_FOLDER}/profiled_test_text_grammar_check.csv')
profiled_test_text_spelling = pd.read_csv(f'{EXTENDED_DATA_FOLDER}/profiled_test_text_spelling_check.csv')
profiled_test_text_granular = pd.read_csv(f'{EXTENDED_DATA_FOLDER}/profiled_test_text_granular_features.csv')

In [None]:
profiled_test_text = pd.concat([test_dataset[['text', 'sentiment']], profiled_test_text_sentiment,  
                                profiled_test_text_ease_of_reading, profiled_test_text_grammar, 
                                profiled_test_text_spelling, profiled_test_text_granular], axis=1)
profiled_test_text = reduce_memory_footprint(profiled_test_text)
profiled_test_text['sentiment_polarity_summarised'] = profiled_test_text['sentiment_polarity_summarised'].apply(lambda x: x.lower())

In [None]:
profiled_test_text

### Hypothesis/question: is the sentiment of the 'text' column correct?

It's possible there is a percentage of rows where the sentiment may be incorrect or different from what we would expect (incorrect data or another reason)

In [None]:
profiled_test_text['do_sentiments_match'] = check_if_sentiment_of_text_matches(profiled_test_text)
profiled_test_text['do_sentiments_match'] = profiled_test_text['do_sentiments_match'].astype('category')

In [None]:
plot_sns_chart(profiled_train_text, 'do_sentiments_match', width=35.0, height=5.0, 
               title=f'Do the sentiments for "text" provided match-up.',
               ylabel_title='Sentiments match', 
               xlabel_title=f'Number of rows. Total rows {profiled_train_text.shape[0]}')

In [None]:
plot_sns_chart(profiled_test_text, 'sentiment', class_separator='do_sentiments_match', width=35.0, height=8.0, 
               title=f'Do the sentiments for "text" provided match-up (grouped by Sentiment).',
               ylabel_title='Spellings are correct', 
               xlabel_title=f'Number of rows. Total rows {profiled_test_text.shape[0]}')

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Conclusion: `41.3%` of rows mismatching sentiments, this could have a negative impact on the model inferencing aspects of the process. The mismatch is spread across all three sentiment types in different ratios. Comparing the observations above with the training set data to the test dataset, we can see the proportions look similar. And so the repeat question: will these rows will impact our machine learning or model building in any form, it is to be seen?

### Hypothesis/question: does the 'text' column have spelling mistakes?

It's possible there is a percentage of rows where the text in the text column have one or more spelling mistakes (incorrect data or another reason)

In [None]:
plot_sns_chart(profiled_test_text, 'spelling_quality_summarised', width=35.0, height=5.0, 
               title=f'Spelling check.',
               ylabel_title='Spellings are correct', 
               xlabel_title=f'Number of rows. Total rows {profiled_test_text.shape[0]}')

In [None]:
plot_sns_chart(profiled_test_text, 'sentiment', class_separator='spelling_quality_summarised', width=35.0, height=8.0, 
               title=f'Are the spellings correct in "text" (grouped by Sentiment).',
               ylabel_title='Spellings are correct', 
               xlabel_title=f'Number of rows. Total rows {profiled_test_text.shape[0]}')

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Conclusion: `41.633%` of rows mismatching sentiments, this could have a negative impact on the model inferencing aspects of the process. The issue is spread across multiple sentiment types. And similar to the rows in training dataset as observed previously.

### Hypothesis/question: are the 'text' or 'selected_text' columns ease to read?

It's possible there is a percentage of rows where the text in the text or selected_text columns are not easy to read for humans or ML models (incorrect data or another reason)

In [None]:
profiled_test_text

In [None]:
plot_sns_chart(profiled_test_text, 'ease_of_reading_summarised', width=35.0, height=6.0, 
               title=f'Is the text in "text" column easy to read.',
               ylabel_title='Ease of reading', 
               xlabel_title=f'Number of rows. Total rows {profiled_test_text.shape[0]}')

In [None]:
plot_sns_chart(profiled_test_text, 'sentiment', class_separator='ease_of_reading_summarised', width=35.0, height=12.0, 
               title=f'Is the text in "text" column easy to read (grouped by Ease of Reading Quality).',
               ylabel_title='Ease of reading', 
               xlabel_title=f'Number of rows. Total rows {profiled_test_text.shape[0]}')

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Conclusion: It appears that `76.6%` (`65.1%` + `11.5%`) of the `text` in the test dataset are readable (human/machine understandable). Similarly, it would be good to assess the rest of the `23.4%` across the different sentiments and tally with other findings above. This can help us determine that during the ML training/inference, if these entries are going to impact our machine learning or model building in any form, it is to be seen.

### Hypothesis/question: (a) 'text' column contains two or more spaces, whitespaces and/or (b) the 'text' column contains two or more repeated letters, digits, (c) and/or two or more repeated punctuations marks

It's possible there is a percentage of rows where the there are repeated spaces, whitespaces, alphanumeric characters and/or punctuations marks.

In [None]:
filter_columns = profiled_test_text[selected_repeat_columns] > 0
profiled_test_text[selected_repeat_columns][filter_columns].describe(percentiles=percentiles)

In [None]:
plot_repeated_entities_groups(profiled_test_text)

<i><p style="font-size:20px; background-color: #FFF1D7; border: 2px solid black; margin: 20px; padding: 20px;">Conclusion: A number of rows have one or more groups of characters (letters, digits, spaces, whitespace, punctuations) that are repeated two or more times in the text column of the test dataset. But on the whole these are a much smaller number of rows of text as compared to the whole corpus in the test dataset. That is from a total of `3,534` rows, we have only `266`, `69`, `1,231`, `1,231`, and `1,115` rows of repeated three or more repeated letters, two or more repeated digits,	two or more repeated spaces, two or more repeated whitespaces and two or more repeated punctuations respectively. The split between **English** and **non-English** characters `3534` (`~99%`)	 and `18` (`<%1`) respectively, just like our previous observations. Again looking at the big-picture, if these rows will impact our machine learning or model building in any form, it is to be seen.

### Prequels/sequels

- [FastChai sessions: Tweet Sentiment Extraction (data-prep)](https://www.kaggle.com/neomatrix369/fastchai-tweet-sentiment-extraction-data-prep/) | [Extended Dataset](https://www.kaggle.com/neomatrix369/tweet-sentiment-extraction-extended)
- **FastChai sessions: Tweet Sentiment Extraction (analysis)**