### Prequels/sequels

- **FastChai sessions: Tweet Sentiment Extraction (data-prep) | [Extended Dataset](https://www.kaggle.com/neomatrix369/tweet-sentiment-extraction-extended)**
- [FastChai sessions: Tweet Sentiment Extraction (analysis)](https://www.kaggle.com/neomatrix369/fastchai-tweet-sentiment-extraction-analysis/)


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import gc

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# !pip install -U nlp_profiler

In [None]:
# from nlp_profiler.core import apply_text_profiling
from nlp_profiler_class import NLPProfiler

In [None]:
DATASET_UPLOAD_FOLDER='/kaggle/working/upload'
EXTENDED_DATA_FOLDER='/kaggle/input/tweet-sentiment-extraction-extended'

In [None]:
%%bash
DATASET_UPLOAD_FOLDER='/kaggle/working/upload'
mkdir -p ${DATASET_UPLOAD_FOLDER}
cp /kaggle/input/tweet-sentiment-extraction-extended/*.csv ${DATASET_UPLOAD_FOLDER} || true

In [None]:
def load_csv_if_exists(filepath: str) -> pd.DataFrame:
    result = None
    if os.path.exists(filepath): 
        result = pd.read_csv(filepath)
        print(f'Finished loading {result.shape[0]} rows from {filepath}')
    else: 
        print(f'Warning: Did NOT find "{filepath}"')
    return result

In [None]:
def save_to_csv_file(dataframe, filename, field_to_drop='text'):
    dataframe_copy = dataframe.drop(field_to_drop, axis=1, errors='ignore')
    dataframe_copy.to_csv(filename, index=False)

In [None]:
training_dataset = pd.read_csv(f'{EXTENDED_DATA_FOLDER}/train.csv')
test_dataset = pd.read_csv(f'{EXTENDED_DATA_FOLDER}/test.csv')

In [None]:
training_dataset.info()

In [None]:
test_dataset.info()

## Training dataset: text column

In [None]:
profiled_train_text_sentiment = load_csv_if_exists(f'{EXTENDED_DATA_FOLDER}/profiled_train_text_sentiment.csv')
profiled_train_text_ease_of_reading = load_csv_if_exists(f'{EXTENDED_DATA_FOLDER}/profiled_train_text_ease_of_reading.csv')
profiled_train_text_grammar = load_csv_if_exists(f'{EXTENDED_DATA_FOLDER}/profiled_train_text_grammar_check.csv')
profiled_train_text_spelling = load_csv_if_exists(f'{EXTENDED_DATA_FOLDER}/profiled_train_text_spelling_check.csv')
profiled_train_text_granular = load_csv_if_exists(f'{EXTENDED_DATA_FOLDER}/profiled_train_text_granular_features.csv')

In [None]:
%%time
if profiled_train_text_sentiment is None:
    profiled_train_text_sentiment = NLPProfiler().apply_text_profiling(
        training_dataset, 'text', 
        params={'high_level': True, # only sentiment analysis will be return
                'ease_of_reading_check': False,
                'spelling_check': False,
                'grammar_check': False,
                'granular': False,                 
                'parallelisation_method': 'using_swifter'}
    )

In [None]:
%%time
if profiled_train_text_ease_of_reading is None:
    profiled_train_text_ease_of_reading = NLPProfiler().apply_text_profiling(
        training_dataset, 'text', 
        params={'high_level': False, # only sentiment analysis will be return
                'ease_of_reading_check': True,
                'spelling_check': False,
                'grammar_check': False,
                'granular': False,                 
                'parallelisation_method': 'using_swifter'}
    )

In [None]:
%%time
if profiled_train_text_grammar is None:
    profiled_train_text_grammar = NLPProfiler().apply_text_profiling(
        training_dataset, 'text', 
        params={'high_level': False,
                'ease_of_reading_check': False,                
                'spelling_check': False,
                'grammar_check': True,
                'granular': False,                 
                'parallelisation_method': 'using_swifter'}
    )

In [None]:
%%time
if profiled_train_text_spelling is None:
    profiled_train_text_spelling = NLPProfiler().apply_text_profiling(
        training_dataset, 'text', 
        params={'high_level': False,
                'ease_of_reading_check': False,                
                'spelling_check': True,
                'grammar_check': False,
                'granular': False,                 
                'parallelisation_method': 'using_swifter'}
    )

In [None]:
%%time
force_generate = True
if (profiled_train_text_granular is None) or force_generate:
    profiled_train_text_granular = NLPProfiler().apply_text_profiling(
        training_dataset, 'text', 
        params={'high_level': False,
                'ease_of_reading_check': False,                
                'spelling_check': False,
                'grammar_check': False,
                'granular': True,                 
                'parallelisation_method': 'using_swifter'}
    )

In [None]:
profiled_train_text = pd.concat([training_dataset['text'], profiled_train_text_granular, profiled_train_text_sentiment, 
                                 profiled_train_text_ease_of_reading, profiled_train_text_grammar, profiled_train_text_spelling], axis=1)

In [None]:
profiled_train_text

In [None]:
%%time
save_to_csv_file(profiled_train_text_sentiment, f'{DATASET_UPLOAD_FOLDER}/profiled_train_text_sentiment.csv')
save_to_csv_file(profiled_train_text_ease_of_reading, f'{DATASET_UPLOAD_FOLDER}/profiled_train_text_ease_of_reading.csv')
save_to_csv_file(profiled_train_text_grammar, f'{DATASET_UPLOAD_FOLDER}/profiled_train_text_grammar_check.csv')
save_to_csv_file(profiled_train_text_spelling, f'{DATASET_UPLOAD_FOLDER}/profiled_train_text_spelling_check.csv')
save_to_csv_file(profiled_train_text_granular, f'{DATASET_UPLOAD_FOLDER}/profiled_train_text_granular_features.csv')

In [None]:
del profiled_train_text_sentiment, profiled_train_text_ease_of_reading, profiled_train_text_grammar, 
del profiled_train_text_spelling, profiled_train_text_granular
gc.collect()

## Training dataset: selected_text column

In [None]:
profiled_train_selected_text_sentiment = load_csv_if_exists(f'{EXTENDED_DATA_FOLDER}/profiled_train_selected_text_sentiment.csv')
profiled_train_selected_text_ease_of_reading = load_csv_if_exists(f'{EXTENDED_DATA_FOLDER}/profiled_train_selected_text_ease_of_reading.csv')
profiled_train_selected_text_grammar = load_csv_if_exists(f'{EXTENDED_DATA_FOLDER}/profiled_train_selected_text_grammar_check.csv')
profiled_train_selected_text_spelling = load_csv_if_exists(f'{EXTENDED_DATA_FOLDER}/profiled_train_selected_text_spelling_check.csv')
profiled_train_selected_text_granular = load_csv_if_exists(f'{EXTENDED_DATA_FOLDER}/profiled_train_selected_text_granular_features.csv')

In [None]:
%%time
if profiled_train_selected_text_sentiment is None:
    profiled_train_selected_text_sentiment = NLPProfiler().apply_text_profiling(
        training_dataset, 'selected_text', 
        params={'high_level': True, # only sentiment analysis will be return
                'ease_of_reading_check': False,
                'spelling_check': False,
                'grammar_check': False,
                'granular': False,                 
                'parallelisation_method': 'using_swifter'}
    )

In [None]:
%%time
if profiled_train_selected_text_ease_of_reading is None:
    profiled_train_selected_text_ease_of_reading = NLPProfiler().apply_text_profiling(
        training_dataset, 'selected_text', 
        params={'high_level': False, 
                'ease_of_reading_check': True,
                'spelling_check': False,
                'grammar_check': False,
                'granular': False,                 
                'parallelisation_method': 'using_swifter'}
    )

In [None]:
%%time
if profiled_train_selected_text_grammar is None:
    profiled_train_selected_text_grammar = NLPProfiler().apply_text_profiling(
        training_dataset, 'selected_text', 
        params={'high_level': False,
                'ease_of_reading_check': False,                
                'spelling_check': False,
                'grammar_check': True,
                'granular': False,                 
                'parallelisation_method': 'using_swifter'}
    )

In [None]:
%%time
if profiled_train_selected_text_spelling is None:
    profiled_train_selected_text_spelling = NLPProfiler().apply_text_profiling(
        training_dataset, 'selected_text', 
        params={'high_level': False,
                'ease_of_reading_check': False,                
                'spelling_check': True,
                'grammar_check': False,
                'granular': False,                 
                'parallelisation_method': 'using_swifter'}
    )

In [None]:
%%time
force_generate = True
if (profiled_train_selected_text_granular is None) or force_generate:
    profiled_train_selected_text_granular = NLPProfiler().apply_text_profiling(
        training_dataset, 'selected_text', 
        params={'high_level': False,
                'ease_of_reading_check': False,                
                'spelling_check': False,
                'grammar_check': False,
                'granular': True,                 
                'parallelisation_method': 'using_swifter'}
    )

In [None]:
profiled_train_selected_text = pd.concat([training_dataset['selected_text'], profiled_train_selected_text_granular,
                                          profiled_train_selected_text_sentiment, profiled_train_selected_text_ease_of_reading, 
                                          profiled_train_selected_text_grammar, profiled_train_selected_text_spelling], axis=1)

In [None]:
profiled_train_selected_text

In [None]:
%%time
save_to_csv_file(profiled_train_selected_text_sentiment, f'{DATASET_UPLOAD_FOLDER}/profiled_train_selected_text_sentiment.csv', 'selected_text')
save_to_csv_file(profiled_train_selected_text_ease_of_reading, f'{DATASET_UPLOAD_FOLDER}/profiled_train_selected_text_ease_of_reading.csv', 'selected_text')
save_to_csv_file(profiled_train_selected_text_grammar, f'{DATASET_UPLOAD_FOLDER}/profiled_train_selected_text_grammar_check.csv', 'selected_text')
save_to_csv_file(profiled_train_selected_text_spelling, f'{DATASET_UPLOAD_FOLDER}/profiled_train_selected_text_spelling_check.csv', 'selected_text')
save_to_csv_file(profiled_train_selected_text_granular, f'{DATASET_UPLOAD_FOLDER}/profiled_train_selected_text_granular_features.csv', 'selected_text')

In [None]:
del profiled_train_selected_text_sentiment, profiled_train_selected_text_ease_of_reading, profiled_train_selected_text_grammar
del profiled_train_selected_text_spelling, profiled_train_selected_text_granular
gc.collect()

## Test dataset: text column

In [None]:
profiled_test_text_sentiment = load_csv_if_exists(f'{EXTENDED_DATA_FOLDER}/profiled_test_text_sentiment.csv')
profiled_test_text_ease_of_reading = load_csv_if_exists(f'{EXTENDED_DATA_FOLDER}/profiled_test_text_ease_of_reading.csv')
profiled_test_text_grammar = load_csv_if_exists(f'{EXTENDED_DATA_FOLDER}/profiled_test_text_grammar_check.csv')
profiled_test_text_spelling = load_csv_if_exists(f'{EXTENDED_DATA_FOLDER}/profiled_test_text_spelling_check.csv')
profiled_test_text_granular = load_csv_if_exists(f'{EXTENDED_DATA_FOLDER}/profiled_test_text_granular_features.csv')

In [None]:
%%time
if profiled_test_text_sentiment is None:
    profiled_test_text_sentiment = NLPProfiler().apply_text_profiling(
        test_dataset, 'text', 
        params={'high_level': True, # only sentiment analysis will be return
                'ease_of_reading_check': False,
                'spelling_check': False,
                'grammar_check': False,
                'granular': False,                 
                'parallelisation_method': 'using_swifter'}
    )

In [None]:
%%time
if profiled_test_text_ease_of_reading is None:
    profiled_test_text_ease_of_reading = NLPProfiler().apply_text_profiling(
        test_dataset, 'text', 
        params={'high_level': False,
                'ease_of_reading_check': True,
                'spelling_check': False,
                'grammar_check': False,
                'granular': False,                 
                'parallelisation_method': 'using_swifter'}
    )

In [None]:
%%time
if profiled_test_text_grammar is None:
    profiled_test_text_grammar = NLPProfiler().apply_text_profiling(
        test_dataset, 'text', 
        params={'high_level': False, 
                'ease_of_reading_check': False,                
                'spelling_check': False,
                'grammar_check': True,
                'granular': False,                 
                'parallelisation_method': 'using_swifter'}
    )

In [None]:
%%time
if profiled_test_text_spelling is None:
    profiled_test_text_spelling = NLPProfiler().apply_text_profiling(
        test_dataset, 'text', 
        params={'high_level': False, 
                'ease_of_reading_check': False,                
                'spelling_check': True,
                'grammar_check': False,
                'granular': False,                 
                'parallelisation_method': 'using_swifter'}
    )

In [None]:
%%time
force_generate = True
if (profiled_test_text_granular is None) or force_generate:
    profiled_test_text_granular = NLPProfiler().apply_text_profiling(
        test_dataset, 'text', 
        params={'high_level': False,
                'ease_of_reading_check': False,                
                'spelling_check': False,
                'grammar_check': False,
                'granular': True,                 
                'parallelisation_method': 'using_swifter'}
    )

In [None]:
profiled_test_text = pd.concat([test_dataset['text'], profiled_test_text_granular, profiled_test_text_sentiment, 
                                profiled_test_text_ease_of_reading, profiled_test_text_grammar, profiled_test_text_spelling], axis=1)

In [None]:
profiled_test_text

In [None]:
save_to_csv_file(profiled_test_text_sentiment, f'{DATASET_UPLOAD_FOLDER}/profiled_test_text_sentiment.csv')
save_to_csv_file(profiled_test_text_ease_of_reading, f'{DATASET_UPLOAD_FOLDER}/profiled_test_text_ease_of_reading.csv')
save_to_csv_file(profiled_test_text_grammar, f'{DATASET_UPLOAD_FOLDER}/profiled_test_text_grammar_check.csv')
save_to_csv_file(profiled_test_text_spelling, f'{DATASET_UPLOAD_FOLDER}/profiled_test_text_spelling_check.csv')
save_to_csv_file(profiled_test_text_granular, f'{DATASET_UPLOAD_FOLDER}/profiled_test_text_granular_features.csv')

In [None]:
del profiled_test_text_sentiment, profiled_test_text_ease_of_reading, profiled_test_text_grammar
del profiled_test_text_spelling, profiled_test_text_granular
gc.collect()

## Uploading newly created/updated csv to your Kaggle Dataset

Setup your local environment with your Kaggle login details (`KAGGLE_KEY` and `KAGGLE_USERNAME`).

In [None]:
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

import os
os.environ['KAGGLE_KEY'] = user_secrets.get_secret("KAGGLE_KEY")
os.environ['KAGGLE_USERNAME'] = user_secrets.get_secret("KAGGLE_USERNAME")

Using the kaggle Python client login, into your account from within the kernel.

In [None]:
import kaggle
kaggle.api.authenticate()

Get the metadata for the dataset you have already created manually - it's best to manually create it and upload the initial csv file(s) into it, to avoid subsequent issues with updating the dataset (as seen during my own end-to-end cycle).

Save the metadata file as a json file but before that, add/update two keys id and id_no with the respective details as shown below and then save it.

In [None]:
OWNER_SLUG='neomatrix369'
DATASET_SLUG='tweet-sentiment-extraction-extended'
dataset_metadata = kaggle.api.metadata_get(OWNER_SLUG, DATASET_SLUG)
dataset_metadata['id'] = dataset_metadata["ownerUser"] + "/" + dataset_metadata['datasetSlug']
dataset_metadata['id_no'] = dataset_metadata['datasetId']
import json
with open(f'{DATASET_UPLOAD_FOLDER}/dataset-metadata.json', 'w') as file:
    json.dump(dataset_metadata, file, indent=4)

Finally call the dataset_create_version() api and pass it the folder where the metadata file exists and also where your .csv and .fth file(s) - those file(s) that you would like to upload into your existing Dataset (as a new version).

In [None]:
%%time
# !kaggle datasets version -m "Updating datasets" -p /kaggle/working/upload
kaggle.api.dataset_create_version(DATASET_UPLOAD_FOLDER, 'Updating datasets')

### Prequels/sequels

- **FastChai sessions: Tweet Sentiment Extraction (data-prep) | [Extended Dataset](https://www.kaggle.com/neomatrix369/tweet-sentiment-extraction-extended)**
- [FastChai sessions: Tweet Sentiment Extraction (analysis)](https://www.kaggle.com/neomatrix369/fastchai-tweet-sentiment-extraction-analysis/)