# **LAB 1: DATA PREPARATION**

In this notebook, we will explore how do prepare data for downstream NLP tasks.  

A non-exhaustive list tasks related to NLP and Generative AI that can benefit from this data
preparation is:

- Question & Answer
- Text Summarization
- Instruct Tuning
- Human-Bot Conversations
- Continued PreTraining


For each of these tasks, there are subtle differences in how one might want to prepare the data.
Some of the key steps one might use here for data prep are also applicable to 
applying guardrails to LLMs (e.g., `Profanity Check` and `Toxicity Detection`)

More on guardrails in Lab #5, but for now let's import our required python 
libraries and get started!


### Step 1: Persistent Python environment

[Implementation](https://www.kaggle.com/code/kononenko/pip-install-once)
[pip-install-forever](https://www.kaggle.com/code/samsammurphy/pip-install-forever)

- Adjust the session persistence settings. 
  - Go to Notebook options (right pane) and set PERSISTENCE to Files only. You can also set it to Variables and Files, if you need persistent variables for your own reasons.
  - Install in target directory: `pip install -r requirements.txt --target=/kaggle/working/workshop`
  - Next:
    1. Import sys: `import sys`
    2. Add path: `sys.path.append("/kaggle/working/workshop")`

In [None]:
!pip install -r https://raw.githubusercontent.com/yasheshshroff/LLMworkshop/main/labs/requirements_lab1.txt

Collecting absl-py==2.0.0
  Using cached absl_py-2.0.0-py3-none-any.whl (130 kB)
Collecting accelerate==0.24.0
  Using cached accelerate-0.24.0-py3-none-any.whl (260 kB)
Collecting aiobotocore==2.7.0
  Using cached aiobotocore-2.7.0-py3-none-any.whl (73 kB)
Collecting aiohttp==3.8.5
  Using cached aiohttp-3.8.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
Collecting aioitertools==0.11.0
  Using cached aioitertools-0.11.0-py3-none-any.whl (23 kB)
Collecting alembic==1.12.1
  Using cached alembic-1.12.1-py3-none-any.whl (226 kB)
Collecting alt-profanity-check==1.3.1
  Using cached alt-profanity-check-1.3.1.tar.gz (1.9 MB)
Collecting annoy==1.17.3
  Using cached annoy-1.17.3.tar.gz (647 kB)
Collecting anyio==4.0.0
  Using cached anyio-4.0.0-py3-none-any.whl (83 kB)
Collecting async-timeout==4.0.3
  Using cached async_timeout-4.0.3-py3-none-any.whl (5.7 kB)
Collecting attrs==23.1.0
  Using cached attrs-23.1.0-py3-none-any.whl (61 kB)
Collecting beautifulsoup4==4.12.2

# Import Required Dependencies 

In [4]:
import warnings
warnings.filterwarnings('ignore')

# set flag for training environment
TRAINING = True

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
from torch.utils.data import Dataset, DataLoader
import pandas as pd 
import numpy as np
from profanity_check import predict_prob

import matplotlib

ImportError: cannot import name 'joblib' from 'sklearn.externals' (/opt/conda/lib/python3.10/site-packages/sklearn/externals/__init__.py)

# Dataset

Training dataset: Text content and metadata for LinkedIn posts during 2021. These posts were collected from the
internet and correspond to a wide array of identified "influencers".

Source: https://www.kaggle.com/datasets/shreyasajal/linkedin-influencers-data


# Import Dataset

To the right of this page, go to "Add Data" and search for "LinkedIn Influencers' Data". Hit the "+" icon to add the data to your `/kaggle/input` data directory.

In [None]:
wget https://raw.githubusercontent.com/yasheshshroff/LLMworkshop/main/labs/data/influencers_data.csv -o influencers_data.csv

In [None]:
# Read in from public S3 Location
df = pd.read_csv("influencers_data.csv")
df.tail()

In [None]:
df.info()

In [None]:
df.shape

There are approximately ~70 LinkedIn influencers in this dataset. Let's take a look at
how many posts we have data on for each influencer.

In [None]:
# count how many records we have for influencers
df.name.value_counts().sort_values().plot(kind='barh', figsize=(8,12))

# Create Small Sample DataFrame

In [None]:
sample_df = df.sample(100)[['name', 'headline', 'content']]

# drop records with missing content
sample_df = sample_df.dropna()

sample_df.head()

# Custom Functions

Set up a custom function to help with the data transformation. 

```
def custom_function(df, text_columns):
    
    # For text cleaning based functions use: 
    for column in text_columns: 
        df[column] = # add change code here 
    
    #For an example of how to filter the df see number_proportion_filter.py or numeric_filer.py 
    
    return df
```


# 🙈 Profanity Check




In [None]:
from profanity_check import predict_prob

threshold = 0.9

sample_df["profanity"] = predict_prob(sample_df["content"])

sample_df = sample_df[sample_df["profanity"] < threshold]
sample_df = sample_df.reset_index(drop=True)
sample_df.head()


In [None]:
sample_df['profanity'].hist()

# 🧐 Text Quality Check

## Flesch Grade reading level

Depending on the task at hand, sometimes it's advantages to evaluate text and
language based on it's estimated reading level. One popular way to do this 
is with the Flesch-Kincaid grade level.  This is a grade formula in that a score
of 9.3 means that a ninth grader would be able to read the document.

The wikipedia article on this technique does a fair job explaining it:


Using the `textstat` python linvrary, we can approximate the reading level and
set our thresholds for what level we find ideal for our downstream NLP task.

In this case, we will limit the reading level between `2` and `10`, but many
other ranges are certainly possible

In [None]:
import textstat 

min_grade_level = 2
max_grade_level = 10

def check_quality(df, text_column):
    df["flesch_grade"] = df[text_column].apply(textstat.flesch_kincaid_grade)
    return df

cleaned = check_quality(sample_df, 'content')

# Filter to only those records with content between 2 and 10 grade level
cleaned = cleaned[(cleaned["flesch_grade"] >= min_grade_level) & (cleaned["flesch_grade"] <= max_grade_level)]
cleaned.head()


In [None]:
sample_df['flesch_grade'].hist()

# Run Preparation Pipeline

In [None]:
# Basic ETL

# Read in some additional utility functions

def remove_trailing_ws(df, text_columns):
    # For text cleaning based functions use: 
    for column in text_columns: 
        df[column] = df[column].str.strip()
    return df

def length_check_func(df,text_column, minLength,maxLength):
    df = df[(df[text_column].str.len() > int(minLength))]
    df = df[(df[text_column].str.len() < int(maxLength))]
    return df

def string_replace(df, text_columns, input, output):
    # For text cleaning based functions use: 
    for column in text_columns: 
        df[column] = df[column].str.replace(input, output)
    return df

# Keep Relevant Columns
cleaned = df[['name', 'headline', 'about', 'content', 'reactions']]

# Dropping missing
cleaned = cleaned.dropna()


# Text Cleaning (Minimal)
cleaned = string_replace(cleaned, ['content'], '…see more', '')
cleaned = remove_trailing_ws(cleaned, ['content'])

max_len = 10000
min_len = 1

cleaned = length_check_func(cleaned, 'content', min_len, max_len)


# Set thresholds

# Profanity Threshold
profanity_threshold = 0.9

# Target Reading Level
min_grade_level = 2
max_grade_level = 10

# Predict Probability of Profane Language
cleaned["profanity"] = predict_prob(cleaned["content"])

# Filter out Profane Language
cleaned = cleaned[cleaned["profanity"] < profanity_threshold]
cleaned = cleaned.reset_index(drop=True)

# Determine Reading Level
cleaned = check_quality(cleaned, 'content')

# Filter to only those records with content between a 2nd and 10th grader
cleaned = cleaned[(cleaned["flesch_grade"] >= min_grade_level) & (cleaned["flesch_grade"] <= max_grade_level)]
cleaned.head()


In [None]:
num_dropped = df.shape[0] - cleaned.shape[0]

print(f"Lost a total of {num_dropped} records, or about {round(num_dropped/df.shape[0], 2) * 100}%")

# One Step Further

If we really want to talk like an influencer, perhaps we should also additionally limit the number of records
by the number of reactions the posts got. Let's take a look.

In [None]:
cleaned.sort_values('reactions')

We can see that the number of reactions these posts from 'influencers' received appear to go 
from zero (yikes!) all the way to over 330k. Noticeably, that single post with ~330k reactions
is simply `Helen is my kinda lady`.

The fact this post received so much attention given it's lack of context is also 
a warning that there could be exogenous latent variables, such as current events, pop culture, etc, that could be driving 
the number of reactions, not necessarily the content itself. 

Let's be just a little more analytical:

In [None]:
# Identify the 90th percentile of reactions over all posts
p90 = np.quantile(cleaned.reactions, 0.9)

print(f'The 90th percentile is {p90} reactions.')


For our purposes, let's focus on the top "performing" content to fine tune our model on

In [None]:
cleaned = cleaned[cleaned.reactions > p90]
cleaned.shape

We're now left with 2,127 high-quality data points to experiment with 
fine tuning on. 

Finally, let's ask h2oGPT to provide a title for our LinkedIn Influencer content.
This process is called `zero-shot text generation` (more on this later)

In [None]:
sample_df = cleaned.sample(5)

from gradio_client import Client
import ast
from pprint import pprint


HOST_URL = "https://gpt-genai.h2o.ai/"
GPT_KEY = "f74f043e-45fc-4dfe-9c33-55a4720427f6"
    
client = Client(HOST_URL)

from tqdm import tqdm
tqdm.pandas()

def generate_title(content):
    
    #try:
    summarize_prompt = 'You are a helpful, respectful and honest assistant the specializes in generating accurate titles of LinkedIn posts. Provide a title for the following post. The title should be a single sentence, not using bullet points. Only include the title in the response. The LinkedIn post is: ' + content
    kwargs = dict(
        instruction_nochat=summarize_prompt, 
        h2ogpt_key = GPT_KEY)
    
    response = client.predict(str(dict(kwargs)), api_name='/submit_nochat_api')
    reply = ast.literal_eval(response)['response']
    #except:
        #reply = 'NA'
    return reply
        
sample_df['title'] = sample_df.progress_apply(lambda row :   generate_title(row['content']), axis=1)
sample_df.head()

In [None]:
pprint(f'CONTENT:')
sample_df['content'].to_list()

In [None]:
print(f'TITLE:')
sample_df['title'].to_list()

In [None]:
# Optional alternative for fine tuning - create an instruction.
cleaned['instruction'] = ('Write a LinkedIn post in the style of an influencer whom has the title of '
  + cleaned['headline'] + ' and can be described by the following: ' 
  + cleaned['about'])

cleaned.sample(5)

# Output Dataset

Now we're ready to store our data set out and experiment with fine-tuning
in Lab # 2 with H2O LLM Studio. 

In [None]:
#######################################################################################
# WARNING! This could take a very long time using the public facing h2oGPT endpoint. ##
#######################################################################################

# Apply to full data set
cleaned['title'] = cleaned.progress_apply(lambda row :   generate_title(row['content']), axis=1)

In [None]:
# Output locally
cleaned.to_csv('influencers_data_prepared.csv', index=False)

---