# DATA AUGMENTAION

----------------------------
#### GIST OF CHANGES DONE IN THIS NOTEBOOK
----------------------------
-   Need for data augmentation arose because the initial webscraped data volume was lower than expected

-   Standardized codes and comments will be augmented with the below approaches
    - <b>ADDITIONAL STANDARDIZATION</b> Existing rows were enhanced
      - All column level operations will be enhance to include the alias(col) code since the ML model will be trained to generate column level codes for column level operations.
      - Removed first() function from column level operations as it is a aggregation function
      - Enhanced comments to describe the sort process if present in code
      - Removed show(), head(), printSchema() as these are display functions  
    -   <b>TEXT AUGMENTATION</b>  -- DISCARDED --
      - SUMY and T5 Summarizers were found to be lacking in changing the text descriptions enough to consider as alternate descriptions. This is probably because of the small text size and these summarizers do not use synonyyms to summarize.
      - Synonym replacement works but the replacements with the basic nltk code seem to be uncommon words which are less likely to be used in real world scenarios. While this had potential and can be fine tuned for replacing only certain keywords in the code descriptions, but in the interest of time, this approach is also not considered and can be taken up in future.
    - <b>MANUAL AUGMENTATION</b> New rows were added.
      - Some function options like sort, read and write etc were exploded to have few possible arguments in the code. i.e. write mode can be append or ignore and hence this two rows were added for these possibilities.
      - Standardized columns like COL_A, COL_B, MY_ALIAS_A etc will replaced with random values in both code and comments.
      - Standarized Literals like LIT_STR_1, LIT_DEC_1, LIT_INT_1 will replaced with random values in both code and comments.

    -   <b>NOTE</b>
      - This notebook can augment data exponentially. However, the base webscraped data is quite small due to which, augmenting some scenarios significanly might increase the data size, but it could also lead to overfitting the basic nature of the descriptions do not change a lot.
      - Initially agumented data up to 40K was used for training but due to compute constraints and Collab limits, the trainings were cumbersome to tune and refine.
      - Datasets with 20K and eventually a dataset size of 11K was selected as it was found to be less time consuming to train and tune.
      - Since all the selected models are ideally supposed to be trained with a huge dataset significantly, selecting a smaller subset of the training data means that we assume the risk of inaccuracies and overfitting of the models.

----------------------------
<br>
<br>


<br>

# Import/Install Necessary libraries

In [None]:
# IMPORT/INSTALL

!pip install sumy
!pip install googletrans==4.0.0-rc1

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
# IMPORT NECESSARY PACKAGES
import io
import re
import pandas as pd
from tqdm import tqdm
from google.colab import drive, files
from nltk.corpus import stopwords

import warnings
warnings.simplefilter(action='ignore')


In [None]:
# DISPLAY FUNCTIONS
def fn_display_header(msg):
  print('-' * 80)
  print(' ' * 10, msg)
  print('-' * 80)

def fn_display_message(msg):
  print(msg)

<br>

# READ INPUT DATA

In [None]:
# READ WEB SCRAPED DATA INTO DATAFRAME AND DISPLAY DETAILS

df_raw = pd.read_csv('ETL_P3_manual_data_standardization_v1.csv').drop('Unnamed: 0', axis=1)

fn_display_header('Display Source Dataframe Metadata')
df_raw.info()

# Drop Duplicates and NULL code_snippet or code_description
fn_display_header('Drop Duplicates code_snippet and code_description')
fn_display_message(f'\n\n --> Display Count Before deleting records: {df_raw.shape[0]}')
df_init = df_raw.drop_duplicates()
fn_display_message(f' --> Display Count After dropping duplicates: {df_init.shape[0]}\n\n')

fn_display_header('Display df_init Dataframe')
#df_raw.style.set_properties(**{'text-align': 'left'})
df_init.head().style.set_properties(**{'text-align': 'left'})


--------------------------------------------------------------------------------
           Display Source Dataframe Metadata
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1215 entries, 0 to 1214
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   code_description  1215 non-null   object
 1   code_snippet      1215 non-null   object
 2   import_line       111 non-null    object
 3   Category          1215 non-null   object
 4   function          1215 non-null   object
 5   origin_str        1215 non-null   object
dtypes: object(6)
memory usage: 57.1+ KB
--------------------------------------------------------------------------------
           Drop Duplicates code_snippet and code_description
--------------------------------------------------------------------------------


 --> Display Count Before deleting records: 1215
 -->

Unnamed: 0,code_description,code_snippet,import_line,Category,function,origin_str
0,Pyspark code to Write a DataFrame into a CSV file.,"df.write.mode(""overwrite"").format(""csv"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.csv,original
1,Pyspark code which Loads a CSV file and returns the result as a DataFrame. This function will go through the input once to determine the input schema ifinferSchema is enabled. To avoid going through the entire data once disableinferSchema option or specify the schema explicitly using schema.,"df.write.mode(""overwrite"").format(""csv"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.csv,original
2,Pyspark code to Read the CSV file as a DataFrame with 'nullValue' option set to 'Hyukjin Kwon'.,"df = spark.read.csv(MY_DIR, schema=df.schema, nullValue=""Hyukjin Kwon"")",,Input/Output,pyspark.sql.DataFrameReader.csv,original
3,Pyspark code which Loads a CSV file and returns the result as a DataFrame. This function will go through the input once to determine the input schema ifinferSchema is enabled. To avoid going through the entire data once disableinferSchema option or specify the schema explicitly using schema.,"df = spark.read.csv(MY_DIR, schema=df.schema, nullValue=""Hyukjin Kwon"")",,Input/Output,pyspark.sql.DataFrameReader.csv,original
4,Pyspark code to Write a DataFrame into a JSON file.,"df.write.mode(""overwrite"").format(""json"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.format,original


<br>
<br>
<br>

----
# <b>ADDITIONAL STANDARDIZATION</b>
----


<br>

# Standardize: Standardize Aliases

In [None]:
# ENHANCE COMMENTS TO INCLUDE ALIASES IF APPLICABLE

# function to handle alias comments.
def fn_handle_alias_comments(comment, code, function, category):
  fn_main = function.split('.')[-1].strip()

  if not comment.strip().endswith('.'):
      comment = comment + '.'

  alias_str = ''
  replaced_flag = 'N'
  ignore_replace_categories = ['Aggregate Functions', 'Grouping', 'Input/Output']

  # Remove first() from code in non aggregation operations.
  if category not in ignore_replace_categories:
    if '.first()' in code:
      code = code.strip().replace('.first()', '')

    # Add alias(col) to the code where it was not present. This is done only for column level functions/code.
    # Also enhance the comments to include details about the alias colummn
    suffix_replace = ''
    for item in ['.first()', '.collect()', '.createOrReplace()', '.show(truncate=False)']:
      if code.strip().endswith(item):
        code = code.strip().replace(item, '')     if 'ALIAS_' not in code else code
        suffix_replace = item

    if code.strip().endswith(')'):
      comment = comment + f"The result is applied for all rows in MY_DF and stored in the column MY_ALIAS_A in the target dataframe."     if 'ALIAS_' not in comment else comment
      code = code.strip()[:-1] + ".alias(MY_ALIAS_A))" + suffix_replace                                                                   if 'ALIAS_' not in code else code
      alias_str = 'MY_ALIAS_A'

  # Enhance the comments to include details about the alias colummn
  alias_match = re.findall(r'alias\((MY_ALIAS_[A-Z])+\)', code)

  ignore_functions = ['get_json_object', 'reduce']
  if replaced_flag == 'N':
    if len(alias_match) > 0 and fn_main not in ignore_functions:
      alias_str = alias_match[0]
      comment = comment + f"The result is stored in the column {alias_match[0]} in the target dataframe."       if 'ALIAS_' not in comment else comment

    elif 'MY_ALIAS_A, MY_ALIAS_B' in code:
      alias_str = 'MY_ALIAS_A, MY_ALIAS_B'
      comment = comment + f"The result is stored in the columns MY_ALIAS_A and MY_ALIAS_B in the target dataframe."      if 'ALIAS_' not in comment else comment

  result = [comment, code, alias_str]
  return result



df_handle_aliases = df_init.copy()

# Derive code_snippet and code_description
df_handle_aliases['code_description'] = df_handle_aliases.apply(lambda x: fn_handle_alias_comments(x['code_description'], x['code_snippet'], x['function'], x['Category'])[0] if x['origin_str'] != 'original' else x['code_description'], axis=1)
df_handle_aliases['code_snippet']     = df_handle_aliases.apply(lambda x: fn_handle_alias_comments(x['code_description'], x['code_snippet'], x['function'], x['Category'])[1] if x['origin_str'] != 'original' else x['code_snippet'], axis=1)
df_handle_aliases['alias_str']         = df_handle_aliases.apply(lambda x: fn_handle_alias_comments(x['code_description'], x['code_snippet'], x['function'], x['Category'])[2] if x['origin_str'] != 'original' else '', axis=1)


# UNCOMMENT FILTERS AS DESIRED TO DEBUG
df_handle_aliases[
    #(df_handle_aliases['code_snippet'].str.contains('alias\(')) &
    (~df_handle_aliases['origin_str'].str.contains('original'))
    #(~df_handle_aliases['alias_str'].str.contains('MY_ALIAS_')) & (df_handle_aliases['code_snippet'] != df_handle_aliases['code_snippet1'] )
    #(~df_handle_aliases['alias_str'].str.contains('MY_ALIAS_')) & (df_handle_aliases['code_description'] != df_handle_aliases['code_description'] )
    & (df_handle_aliases['alias_str'].str.contains('MY_ALIAS_'))
    #(~df_handle_aliases['alias_str'].str.contains('MY_ALIAS_'))
#].count()  # ~ 406 aliases replaced
][['code_description', 'alias_str', 'code_snippet', 'function', 'Category']].style.set_properties(**{'text-align': 'left'})
#][['origin_str', 'code_description', 'code_snippet', 'code_description1', 'code_snippet1', 'Category']].style.set_properties(**{'text-align': 'left'})

df_handle_aliases.head().style.set_properties(**{'text-align': 'left'})

Unnamed: 0,code_description,code_snippet,import_line,Category,function,origin_str,alias_str
0,Pyspark code to Write a DataFrame into a CSV file.,"df.write.mode(""overwrite"").format(""csv"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.csv,original,
1,Pyspark code which Loads a CSV file and returns the result as a DataFrame. This function will go through the input once to determine the input schema ifinferSchema is enabled. To avoid going through the entire data once disableinferSchema option or specify the schema explicitly using schema.,"df.write.mode(""overwrite"").format(""csv"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.csv,original,
2,Pyspark code to Read the CSV file as a DataFrame with 'nullValue' option set to 'Hyukjin Kwon'.,"df = spark.read.csv(MY_DIR, schema=df.schema, nullValue=""Hyukjin Kwon"")",,Input/Output,pyspark.sql.DataFrameReader.csv,original,
3,Pyspark code which Loads a CSV file and returns the result as a DataFrame. This function will go through the input once to determine the input schema ifinferSchema is enabled. To avoid going through the entire data once disableinferSchema option or specify the schema explicitly using schema.,"df = spark.read.csv(MY_DIR, schema=df.schema, nullValue=""Hyukjin Kwon"")",,Input/Output,pyspark.sql.DataFrameReader.csv,original,
4,Pyspark code to Write a DataFrame into a JSON file.,"df.write.mode(""overwrite"").format(""json"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.format,original,


<br>

# Standardize: Remove display functions

In [None]:
# REMOVE DISPLAY FUNCTIONS FROM CODE

# Function to remove display functions like show(), head(),  printSchema()
def fn_remove_display_functions(comment, code, function, category):
  fn_main = function.split('.')[-1].strip()

  if not comment.strip().endswith('.'):
      comment = comment + '.'

  display_functions = ['.show()', '.show(truncate=False)', '.head()',  '.printSchema()']

  # Remove the display functions
  for item in display_functions:
    code = code.strip().replace(item, '')

  result = code
  return result


df_handle_display_fn = df_handle_aliases[['code_description', 'code_snippet', 'import_line', 'Category', 'function', 'origin_str']].copy()

# Call function to remove display functions from code
df_handle_display_fn['code_snippet']     = df_handle_aliases.apply(lambda x: fn_remove_display_functions(x['code_description'], x['code_snippet'], x['function'], x['Category']), axis=1)

# UNCOMMENT AS DESIRED TO DEBUG
df_handle_display_fn[
    (df_handle_display_fn['code_snippet'].str.contains('show\('))
    #(df_handle_display_fn['code_snippet'].str.contains('head\('))
    #(df_handle_display_fn['code_snippet'].str.contains('printSchema\('))
    #(df_handle_display_fn['code_snippet'] == df_handle_display_fn['code_snippet1'] )
#].count()  # 54 shows() | 7 head | 16 printSchema
].style.set_properties(**{'text-align': 'left'})

df_handle_display_fn.head().style.set_properties(**{'text-align': 'left'})


Unnamed: 0,code_description,code_snippet,import_line,Category,function,origin_str
0,Pyspark code to Write a DataFrame into a CSV file.,"df.write.mode(""overwrite"").format(""csv"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.csv,original
1,Pyspark code which Loads a CSV file and returns the result as a DataFrame. This function will go through the input once to determine the input schema ifinferSchema is enabled. To avoid going through the entire data once disableinferSchema option or specify the schema explicitly using schema.,"df.write.mode(""overwrite"").format(""csv"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.csv,original
2,Pyspark code to Read the CSV file as a DataFrame with 'nullValue' option set to 'Hyukjin Kwon'.,"df = spark.read.csv(MY_DIR, schema=df.schema, nullValue=""Hyukjin Kwon"")",,Input/Output,pyspark.sql.DataFrameReader.csv,original
3,Pyspark code which Loads a CSV file and returns the result as a DataFrame. This function will go through the input once to determine the input schema ifinferSchema is enabled. To avoid going through the entire data once disableinferSchema option or specify the schema explicitly using schema.,"df = spark.read.csv(MY_DIR, schema=df.schema, nullValue=""Hyukjin Kwon"")",,Input/Output,pyspark.sql.DataFrameReader.csv,original
4,Pyspark code to Write a DataFrame into a JSON file.,"df.write.mode(""overwrite"").format(""json"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.format,original


<br>

# Standardize: Sort Function comments

In [None]:
# HANDLE COMMENTS FOR SORT FUNCTIONS IN CODE

# Function to remove display functions like show(), head(),  printSchema()
def fn_handle_sort(comment, code, function, category):
  fn_main = function.split('.')[-1].strip()
  result = ''

  if not comment.strip().endswith('.'):
      comment = comment + '.'

  sort_match = re.findall(r"\.sort\((.*?)\)", code)
  if len(sort_match) > 0:
    if '(' not in sort_match[0]:
      comment = comment + f"The data is sorted on the column {sort_match[0]} in ascending order."

  result = comment
  return result


df_handle_sort = df_handle_display_fn.copy()

# Call function to remove display functions from code
df_handle_sort['code_description']     = df_handle_sort.apply(lambda x: fn_handle_sort(x['code_description'], x['code_snippet'], x['function'], x['Category']), axis=1)

# UNCOMMENT AS DESIRED TO DEBUG
df_handle_sort[
    (df_handle_sort['code_snippet'].str.contains('\.sort\('))
    #(df_handle_sort['code_description'] != df_handle_sort['code_description1'])
].count()  # 29 sort() replaced
#].style.set_properties(**{'text-align': 'left'})

df_handle_sort.head().style.set_properties(**{'text-align': 'left'})

Unnamed: 0,code_description,code_snippet,import_line,Category,function,origin_str
0,Pyspark code to Write a DataFrame into a CSV file.,"df.write.mode(""overwrite"").format(""csv"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.csv,original
1,Pyspark code which Loads a CSV file and returns the result as a DataFrame. This function will go through the input once to determine the input schema ifinferSchema is enabled. To avoid going through the entire data once disableinferSchema option or specify the schema explicitly using schema.,"df.write.mode(""overwrite"").format(""csv"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.csv,original
2,Pyspark code to Read the CSV file as a DataFrame with 'nullValue' option set to 'Hyukjin Kwon'.,"df = spark.read.csv(MY_DIR, schema=df.schema, nullValue=""Hyukjin Kwon"")",,Input/Output,pyspark.sql.DataFrameReader.csv,original
3,Pyspark code which Loads a CSV file and returns the result as a DataFrame. This function will go through the input once to determine the input schema ifinferSchema is enabled. To avoid going through the entire data once disableinferSchema option or specify the schema explicitly using schema.,"df = spark.read.csv(MY_DIR, schema=df.schema, nullValue=""Hyukjin Kwon"")",,Input/Output,pyspark.sql.DataFrameReader.csv,original
4,Pyspark code to Write a DataFrame into a JSON file.,"df.write.mode(""overwrite"").format(""json"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.format,original


<br>

# Standardize: Comments with MY_DIR references

In [None]:
# HANDLE COMMENTS FOR CODE WITH MY_DIR REFERENCES

# Function to remove display functions like show(), head(),  printSchema()
def fn_handle_my_dir(comment, code, function, category):
  fn_main = function.split('.')[-1].strip()
  result = ''
  delete = 'N'

  if not comment.strip().endswith('.'):
      comment = comment + '.'

  # On Manul validation, these codes are duplicates but have incorrect code descriptions
  if len(comment.strip().split('.')) > 2 and 'MY_DIR' in code and 'LIT_' not in comment:
    delete = '-Y-'

  if 'MY_DIR' in code:
    suffix = ''
    if '.option("header", True' in code:
      suffix = 'with a header record'
    if 'directory MY_DIR' not in comment and 'write' in code:
      comment = comment + f'The data is written to file or directory MY_DIR {suffix}.'
      comment = comment.replace('Write', 'overwrite').replace('written', 'overwritten')       if 'overwrite' in code else comment
    if 'directory MY_DIR' not in comment and 'read' in code:
      comment = comment + f'The data is read from file or directory MY_DIR.'
  if 'MY_DIR' in comment:
    if 'overwrite' in code and 'overwrit' not in comment:
      comment = comment.replace(' written', ' overwritten')

  code = code.replace('AnalysisException:', '')
  result = [comment, code, delete]
  return result


df_handle_my_dir = df_handle_sort.copy()

# Call function to remove display functions from code
df_handle_my_dir['delete']              = df_handle_my_dir.apply(lambda x: fn_handle_my_dir(x['code_description'], x['code_snippet'], x['function'], x['Category'])[2], axis=1)
df_handle_my_dir['code_description']   = df_handle_my_dir.apply(lambda x: fn_handle_my_dir(x['code_description'], x['code_snippet'], x['function'], x['Category'])[0], axis=1)
df_handle_my_dir['code_snippet']        = df_handle_my_dir.apply(lambda x: fn_handle_my_dir(x['code_description'], x['code_snippet'], x['function'], x['Category'])[1], axis=1)


# Drop unwanted records -- these codes are duplicates but have incorrect code descriptions
fn_display_message(f' --> Display Count Before dropping duplicates - {df_handle_my_dir.shape[0]}')
df_handle_my_dir = df_handle_my_dir[df_handle_my_dir['delete'] != '-Y-']
fn_display_message(f' --> Display Count After dropping duplicates - {df_handle_my_dir.shape[0]}')

# UNCOMMENT AS DESIRED TO DEBUG
df_handle_my_dir[
    (df_handle_my_dir['code_snippet'].str.contains('MY_DIR'))
    #& (df_handle_my_dir['delete'] == 'Y' )
    #(df_handle_my_dir['code_description'] != df_handle_my_dir['code_description1'])
#].count()  # 77 MY_DIR
#][['code_description', 'code_snippet1', 'code_description1', 'code_snippet', 'delete']].style.set_properties(**{'text-align': 'left'})
].style.set_properties(**{'text-align': 'left'})

df_handle_my_dir.head().style.set_properties(**{'text-align': 'left'})


 --> Display Count Before dropping duplicates - 1195
 --> Display Count After dropping duplicates - 1177


Unnamed: 0,code_description,code_snippet,import_line,Category,function,origin_str,delete
0,Pyspark code to overwrite a DataFrame into a CSV file.The data is overwritten to file or directory MY_DIR .,"df.write.mode(""overwrite"").format(""csv"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.csv,original,N
2,Pyspark code to Read the CSV file as a DataFrame with 'nullValue' option set to 'Hyukjin Kwon'.The data is read from file or directory MY_DIR.,"df = spark.read.csv(MY_DIR, schema=df.schema, nullValue=""Hyukjin Kwon"")",,Input/Output,pyspark.sql.DataFrameReader.csv,original,N
4,Pyspark code to overwrite a DataFrame into a JSON file.The data is overwritten to file or directory MY_DIR .,"df.write.mode(""overwrite"").format(""json"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.format,original,N
5,Pyspark code which Specifies the input data source format.The data is overwritten to file or directory MY_DIR .,"df.write.mode(""overwrite"").format(""json"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.format,original,N
6,Pyspark code to Read the JSON file as a DataFrame.The data is read from file or directory MY_DIR.,df = spark.read.format('json').load(MY_DIR),,Input/Output,pyspark.sql.DataFrameReader.format,original,N


<br>

# CREATE FINAL STANDARDIZED DATASET

In [None]:
# CREATE FINAL STANDARDIZED DATASET

df_final_standardized = df_handle_my_dir[['code_description', 'code_snippet', 'import_line', 'Category', 'function', 'origin_str']].copy()

fn_display_header('Display df_final_standardized COLUMN/COUNT Details')
df_final_standardized.info()

--------------------------------------------------------------------------------
           Display df_final_standardized COLUMN/COUNT Details
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Index: 1177 entries, 0 to 1214
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   code_description  1177 non-null   object
 1   code_snippet      1177 non-null   object
 2   import_line       110 non-null    object
 3   Category          1177 non-null   object
 4   function          1177 non-null   object
 5   origin_str        1177 non-null   object
dtypes: object(6)
memory usage: 64.4+ KB


<br>
<br>
<br>

----
# <b>AUGMENTATION SECTION: BASIC </b>
----


<br>

# Augmentation: display functions

In [None]:
# AUGMENT/HANDLE DISPLAY COMMENTS

display_functions = ['.show()', '.show(truncate=False)', '.head()',  '.printSchema()']

# Create a blank dataframe to start with
df_augment_display_fn = pd.DataFrame(columns=df_final_standardized.columns)

# Defaults
df_name = 'MY_DF'
import_line = ''
category = 'Display Function'
origin_str = 'Augmented Display'
disp_val = ''

display_fn_array = []

display_ctrl_dict = {
    'show'        : f"Pyspark code which displays the top 20 rows in the dataframe {df_name} as a list of row objects. This is the default option.",
    'head'        : f"Pyspark code which displays the top 20 rows in the dataframe {df_name} in tabular format. This is the default option.",
    'printSchema' : f"Pyspark code which displays the schema or structure of the dataframe {df_name}. It includes culumn names, data types and whether column is nullable."
}

for disp_fn, disp_comment in display_ctrl_dict.items():

  # Augment show() and head()
  if disp_fn in ['show', 'head']:
    display_fn_array.append({
        'code_description': disp_comment,
        'code_snippet'    : f"{df_name}.{disp_fn}()",
        'import_line'     : import_line,
        'Category'        : category,
        'function'        : disp_fn,
        'origin_str'      : origin_str,
      })

    add = 7
    for disp_val in [5, 10, 15, 50, 75, 100, 125, 150, 175, 200, 223, 258, 273, 299, 350, 371, 446, 500, 750, 1000]:
      add = add + 11
      display_fn_array.append({
        'code_description': f"Pyspark code which displays the top {disp_val} rows in the dataframe {df_name}.",
        'code_snippet'    : f"{df_name}.{disp_fn}({disp_val + add if disp_fn == 'show' else disp_val})",
        'import_line'     : import_line,
        'Category'        : category,
        'function'        : disp_fn,
        'origin_str'      : origin_str,
      })

  # Augment printSchema()
  if disp_fn == 'printSchema':
    display_fn_array.append({
        'code_description': disp_comment,
        'code_snippet'    : f"{df_name}.{disp_fn}()",
        'import_line'     : import_line,
        'Category'        : category,
        'function'        : disp_fn,
        'origin_str'      : origin_str,
      })

# Convert the list of dictionaries to a DataFrame
df_augment_display_fn = pd.DataFrame(display_fn_array)

fn_display_header('Display Augmented Dataframe information: df_augment_display_fn')
df_augment_display_fn.info()

fn_display_header('Display Augmented Dataframe rows: df_augment_display_fn')
df_augment_display_fn.head().style.set_properties(**{'text-align': 'left'})


--------------------------------------------------------------------------------
           Display Augmented Dataframe information: df_augment_display_fn
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43 entries, 0 to 42
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   code_description  43 non-null     object
 1   code_snippet      43 non-null     object
 2   import_line       43 non-null     object
 3   Category          43 non-null     object
 4   function          43 non-null     object
 5   origin_str        43 non-null     object
dtypes: object(6)
memory usage: 2.1+ KB
--------------------------------------------------------------------------------
           Display Augmented Dataframe rows: df_augment_display_fn
--------------------------------------------------------------------------------


Unnamed: 0,code_description,code_snippet,import_line,Category,function,origin_str
0,Pyspark code which displays the top 20 rows in the dataframe MY_DF as a list of row objects. This is the default option.,MY_DF.show(),,Display Function,show,Augmented Display
1,Pyspark code which displays the top 5 rows in the dataframe MY_DF.,MY_DF.show(23),,Display Function,show,Augmented Display
2,Pyspark code which displays the top 10 rows in the dataframe MY_DF.,MY_DF.show(39),,Display Function,show,Augmented Display
3,Pyspark code which displays the top 15 rows in the dataframe MY_DF.,MY_DF.show(55),,Display Function,show,Augmented Display
4,Pyspark code which displays the top 50 rows in the dataframe MY_DF.,MY_DF.show(101),,Display Function,show,Augmented Display


<br>

# Augmentation: sort functions

In [None]:
# AUGMENT CODES USING SORT

df_augment_sort_fn = pd.DataFrame(columns=df_final_standardized.columns)

df_iter_sort = df_final_standardized[(df_final_standardized['code_snippet'].str.contains('\.sort\(')) & (df_final_standardized['code_description'].str.contains('in ascending order\.'))]

fn_display_message(f' --> Display Count of iterations - {df_iter_sort.shape[0]}')

new_rows = [] # Create a list to store new rows as dictionaries
for index, row in df_iter_sort.iterrows():
  for item in [', ascending=True)', ', ascending=False)']:
    code = row['code_snippet'][:-1] + item
    comment = row['code_description'].replace('ascending', 'descending')       if 'False' in item else row['code_description']

    # Append a dictionary representing the new row to the list
    new_rows.append({
      'code_description': comment,
      'code_snippet': code,
      'import_line': row['import_line'],
      'Category': row['Category'],
      'function': row['function'],
      'origin_str': row['origin_str']
    })

# Concatenate the new rows to the DataFrame
df_augment_sort_fn = pd.concat([df_augment_sort_fn, pd.DataFrame(new_rows)], ignore_index=True)

fn_display_header('Display Augmented Dataframe information: df_augment_sort_fn')
df_augment_sort_fn.info()

fn_display_header('Display Augmented Dataframe rows: df_augment_sort_fn')
df_augment_sort_fn.head().style.set_properties(**{'text-align': 'left'})

 --> Display Count of iterations - 29
--------------------------------------------------------------------------------
           Display Augmented Dataframe information: df_augment_sort_fn
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58 entries, 0 to 57
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   code_description  58 non-null     object
 1   code_snippet      58 non-null     object
 2   import_line       0 non-null      object
 3   Category          58 non-null     object
 4   function          58 non-null     object
 5   origin_str        58 non-null     object
dtypes: object(6)
memory usage: 2.8+ KB
--------------------------------------------------------------------------------
           Display Augmented Dataframe rows: df_augment_sort_fn
--------------------------------------------------------------------------

Unnamed: 0,code_description,code_snippet,import_line,Category,function,origin_str
0,"Pyspark code which Compute aggregates and returns the result as a DataFrame. The available aggregate functions can be built in aggregation functions such as avg max min sum count group aggregate pandas UDFs created with pyspark.The data is sorted on the column ""name"" in ascending order.","df.groupBy(df.name).agg({""*"": ""count""}).sort(""name"", ascending=True)",,Grouping,pyspark.sql.GroupedData.agg,original
1,"Pyspark code which Compute aggregates and returns the result as a DataFrame. The available aggregate functions can be built in aggregation functions such as avg max min sum count group aggregate pandas UDFs created with pyspark.The data is sorted on the column ""name"" in descending order.","df.groupBy(df.name).agg({""*"": ""count""}).sort(""name"", ascending=False)",,Grouping,pyspark.sql.GroupedData.agg,original
2,"Pyspark code which Compute aggregates and returns the result as a DataFrame. The available aggregate functions can be built in aggregation functions such as avg max min sum count group aggregate pandas UDFs created with pyspark.The data is sorted on the column ""name"" in ascending order.","df.groupBy(df.name).agg(sf.min(df.age)).sort(""name"", ascending=True)",,Grouping,pyspark.sql.GroupedData.agg,original
3,"Pyspark code which Compute aggregates and returns the result as a DataFrame. The available aggregate functions can be built in aggregation functions such as avg max min sum count group aggregate pandas UDFs created with pyspark.The data is sorted on the column ""name"" in descending order.","df.groupBy(df.name).agg(sf.min(df.age)).sort(""name"", ascending=False)",,Grouping,pyspark.sql.GroupedData.agg,original
4,"Pyspark code which Compute aggregates and returns the result as a DataFrame. The available aggregate functions can be built in aggregation functions such as avg max min sum count group aggregate pandas UDFs created with pyspark.The data is sorted on the column ""name"" in ascending order.","df.groupBy(df.name).agg(min_udf(df.age)).sort(""name"", ascending=True)",,Grouping,pyspark.sql.GroupedData.agg,original


<br>

# Augmentation: Miscellaneous functions

In [None]:
# MISCELLANEOUS AUGMENTATION: FOR MY_DIR REFERENCES USING VRAIOUS APPLICABLE ARGUMENT OPTIONS

df_augment_misc_fn = pd.DataFrame(columns=df_final_standardized.columns)

df_iter_misc = df_final_standardized[
    (df_final_standardized['code_snippet'].str.contains('option\("header", True'))
    | ( (df_final_standardized['code_snippet'].str.contains('mode')) & (df_final_standardized['code_snippet'].str.contains('overwrite')) )
    ]

fn_display_message(f' --> Display Count of iterations - {df_iter_misc.shape[0]}')

new_rows = [] # Create a list to store new rows as dictionaries
for index, row in df_iter_misc.iterrows():
  for item in [', ascending=True)', ', ascending=False)']:
    result_list = []

    if '.option("header", True' in row['code_snippet']:
      # pass 1: replace options with blank to mimic default
      code = row['code_snippet'].replace('.option("header", True)', '')
      comment = row['code_description'].replace("with a header record", "with a header record by default").replace("with header columns", "with header columns by default")
      result_list.append([comment, code])

      # pass 2: replace True with False
      code = row['code_snippet'].replace('True', 'False')
      comment = row['code_description'].replace("with a header record", "without a header record").replace("with header columns", "without header columns")
      result_list.append([comment, code])

    if 'mode' in row['code_snippet'] and 'overwrite' in row['code_snippet']:
      # pass 1: replace overwrite with append
      code = row['code_snippet'].replace('overwrite', 'append')
      comment = row['code_description'].replace("overwritten", "appended")
      result_list.append([comment, code])

      # pass 2: replace overwrite with ignore
      code = row['code_snippet'].replace('overwrite', 'ignore')
      comment = row['code_description'].replace("The data is overwritten to file or directory MY_DIR", "The write is ignored if the file MY_DIR already exists")
      result_list.append([comment, code])

      # pass 3: replace overwrite with error
      code = row['code_snippet'].replace('overwrite', 'error')
      comment = row['code_description'].replace("The data is overwritten to file or directory MY_DIR", "Throws an error if the file MY_DIR already exists")
      result_list.append([comment, code])

    # Build list of dictionaries representing new rows
    for item in result_list:
      new_rows.append({
      'code_description': item[0],
      'code_snippet': item[1],
      'import_line': row['import_line'],
      'Category': row['Category'],
      'function': row['function'],
      'origin_str': row['origin_str']
      })

# Concatenate the new rows to the DataFrame using pd.concat()
df_augment_misc_fn = pd.concat([df_augment_misc_fn, pd.DataFrame(new_rows)], ignore_index=True)

# Drop Duplicates
fn_display_message(f' --> Display Count Before dropping duplicates - {df_augment_misc_fn.shape[0]}')
df_augment_misc_fn = df_augment_misc_fn.drop_duplicates()
fn_display_message(f' --> Display Count After dropping duplicates - {df_augment_misc_fn.shape[0]}')

fn_display_header('Display Augmented Dataframe information: df_augment_misc_fn')
df_augment_misc_fn.info()

fn_display_header('Display Augmented Dataframe rows: df_augment_misc_fn')
df_augment_misc_fn.head().style.set_properties(**{'text-align': 'left'})

 --> Display Count of iterations - 37
 --> Display Count Before dropping duplicates - 234
 --> Display Count After dropping duplicates - 117
--------------------------------------------------------------------------------
           Display Augmented Dataframe information: df_augment_misc_fn
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Index: 117 entries, 0 to 230
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   code_description  117 non-null    object
 1   code_snippet      117 non-null    object
 2   import_line       0 non-null      object
 3   Category          117 non-null    object
 4   function          117 non-null    object
 5   origin_str        117 non-null    object
dtypes: object(6)
memory usage: 6.4+ KB
--------------------------------------------------------------------------------
           Display Augmented Dataframe

Unnamed: 0,code_description,code_snippet,import_line,Category,function,origin_str
0,Pyspark code to overwrite a DataFrame into a CSV file.The data is appended to file or directory MY_DIR .,"df.write.mode(""append"").format(""csv"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.csv,original
1,Pyspark code to overwrite a DataFrame into a CSV file.The write is ignored if the file MY_DIR already exists .,"df.write.mode(""ignore"").format(""csv"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.csv,original
2,Pyspark code to overwrite a DataFrame into a CSV file.Throws an error if the file MY_DIR already exists .,"df.write.mode(""error"").format(""csv"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.csv,original
6,Pyspark code to overwrite a DataFrame into a JSON file.The data is appended to file or directory MY_DIR .,"df.write.mode(""append"").format(""json"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.format,original
7,Pyspark code to overwrite a DataFrame into a JSON file.The write is ignored if the file MY_DIR already exists .,"df.write.mode(""ignore"").format(""json"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.format,original


<br>

# CONCATENATE AUGMENTED DATASETS

In [None]:
# CONCATENATE AUGMENTED DATASETS - PASS 1

df_concat_pass_1 = pd.concat([df_final_standardized, df_augment_display_fn, df_augment_sort_fn, df_augment_misc_fn], ignore_index=True)

# Drop Duplicates
fn_display_message(f' --> Display Count Before dropping duplicates - {df_concat_pass_1.shape[0]}')
df_concat_pass_1 = df_concat_pass_1.drop_duplicates()
fn_display_message(f' --> Display Count After dropping duplicates - {df_concat_pass_1.shape[0]}')

# Fix some common syntax error
df_concat_pass_1['code_description']  = df_concat_pass_1['code_description'].apply(lambda x: x.replace('..', '.').replace('  ', ' ').replace('. .', '.'))

fn_display_header('Display Augmented Dataframe information: df_concat_pass_1')
df_concat_pass_1.info()

fn_display_header('Display Augmented Dataframe rows: df_concat_pass_1')
df_concat_pass_1.head().style.set_properties(**{'text-align': 'left'})




 --> Display Count Before dropping duplicates - 1395
 --> Display Count After dropping duplicates - 1395
--------------------------------------------------------------------------------
           Display Augmented Dataframe information: df_concat_pass_1
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1395 entries, 0 to 1394
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   code_description  1395 non-null   object
 1   code_snippet      1395 non-null   object
 2   import_line       153 non-null    object
 3   Category          1395 non-null   object
 4   function          1395 non-null   object
 5   origin_str        1395 non-null   object
dtypes: object(6)
memory usage: 65.5+ KB
--------------------------------------------------------------------------------
           Display Augmented Dataframe rows: df_concat_pass_1
------

Unnamed: 0,code_description,code_snippet,import_line,Category,function,origin_str
0,Pyspark code to overwrite a DataFrame into a CSV file.The data is overwritten to file or directory MY_DIR .,"df.write.mode(""overwrite"").format(""csv"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.csv,original
1,Pyspark code to Read the CSV file as a DataFrame with 'nullValue' option set to 'Hyukjin Kwon'.The data is read from file or directory MY_DIR.,"df = spark.read.csv(MY_DIR, schema=df.schema, nullValue=""Hyukjin Kwon"")",,Input/Output,pyspark.sql.DataFrameReader.csv,original
2,Pyspark code to overwrite a DataFrame into a JSON file.The data is overwritten to file or directory MY_DIR .,"df.write.mode(""overwrite"").format(""json"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.format,original
3,Pyspark code which Specifies the input data source format.The data is overwritten to file or directory MY_DIR .,"df.write.mode(""overwrite"").format(""json"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.format,original
4,Pyspark code to Read the JSON file as a DataFrame.The data is read from file or directory MY_DIR.,df = spark.read.format('json').load(MY_DIR),,Input/Output,pyspark.sql.DataFrameReader.format,original


<br>
<br>
<br>

----
# <b>TEXT AUGMENTATION</b>
----

<br>

# Text Augmentation: SUMY

In [None]:
# CODE_DESCRIPTION AUGMENTATION TECHNIQUE - SUMY

df_aug_sumy = df_concat_pass_1.copy()

from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer

# Parse the code_description column as plain text
sumy_summaries = []
for index, row in tqdm(df_aug_sumy.iterrows()):

  # Concatenate all lines in the code_description column
  code_description = "\n".join(row["code_description"].split('.'))

  sumy_parser = PlaintextParser.from_string(code_description, Tokenizer("english"))

  # LexRankSummarizer: LexRank is an unsupervised approach inspired by algorithms like PageRank and HITS (Hypertext Induced Topic Selection).
  # It ranks sentences based on their cosine similarity and degree centrality in the sentence graph
  sumy_summarizer = LexRankSummarizer()
  sumy_summary = sumy_summarizer(sumy_parser.document, 2)

  sumy_summary = '. '.join([str(sentence) for sentence in sumy_summary])
  sumy_summaries.append(sumy_summary)

df_aug_sumy['sumy_summary'] = sumy_summaries

df_aug_sumy[
    df_aug_sumy['code_description'] != df_aug_sumy['sumy_summary']
#].style.set_properties(**{'text-align': 'left'})
][[ 'code_description', 'sumy_summary']].head().style.set_properties(**{'text-align': 'left'})


1395it [00:01, 840.05it/s] 


Unnamed: 0,code_description,sumy_summary
0,Pyspark code to overwrite a DataFrame into a CSV file.The data is overwritten to file or directory MY_DIR .,Pyspark code to overwrite a DataFrame into a CSV file The data is overwritten to file or directory MY_DIR
1,Pyspark code to Read the CSV file as a DataFrame with 'nullValue' option set to 'Hyukjin Kwon'.The data is read from file or directory MY_DIR.,Pyspark code to Read the CSV file as a DataFrame with 'nullValue' option set to 'Hyukjin Kwon' The data is read from file or directory MY_DIR
2,Pyspark code to overwrite a DataFrame into a JSON file.The data is overwritten to file or directory MY_DIR .,Pyspark code to overwrite a DataFrame into a JSON file The data is overwritten to file or directory MY_DIR
3,Pyspark code which Specifies the input data source format.The data is overwritten to file or directory MY_DIR .,Pyspark code which Specifies the input data source format The data is overwritten to file or directory MY_DIR
4,Pyspark code to Read the JSON file as a DataFrame.The data is read from file or directory MY_DIR.,Pyspark code to Read the JSON file as a DataFrame The data is read from file or directory MY_DIR


<br>

# Text Augmentation: T5

In [None]:
from transformers import T5Tokenizer, T5ForConditionalGeneration

df_aug_t5_summarizer = df_concat_pass_1.copy()

def fn_summarize_text(text, model, tokenizer, max_length=512, num_beams=3):
    inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=max_length, truncation=True)
    summary_ids = model.generate(inputs, max_length=50, num_beams=num_beams)
    summarized_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    return summarized_text

t5_summaries = []
cnt = 5
for index, row in df_aug_t5_summarizer.head(cnt).iterrows():
  model=T5ForConditionalGeneration.from_pretrained("t5-small")
  tokenizer=T5Tokenizer.from_pretrained("t5-small")

  text = row['code_description']
  summary = fn_summarize_text(text, model, tokenizer)
  t5_summaries.append([text, summary])

fn_display_header(f"Display the T5 Summaries for the first {cnt} rows in dataset")
for idx, item in enumerate(t5_summaries):
  print(f"------ {idx} ------")
  print(f"text    : {item[0]}")
  print(f"summarry: {item[0]}")



config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the as

--------------------------------------------------------------------------------
           Display the T5 Summaries for the first 5 rows in dataset
--------------------------------------------------------------------------------
------ 0 ------
text    : Pyspark code to overwrite a DataFrame into a CSV file.The data is overwritten to file or directory MY_DIR .
summarry: Pyspark code to overwrite a DataFrame into a CSV file.The data is overwritten to file or directory MY_DIR .
------ 1 ------
text    : Pyspark code to Read the CSV file as a DataFrame with 'nullValue' option set to 'Hyukjin Kwon'.The data is read from file or directory MY_DIR.
summarry: Pyspark code to Read the CSV file as a DataFrame with 'nullValue' option set to 'Hyukjin Kwon'.The data is read from file or directory MY_DIR.
------ 2 ------
text    : Pyspark code to overwrite a DataFrame into a JSON file.The data is overwritten to file or directory MY_DIR .
summarry: Pyspark code to overwrite a DataFrame into a JSON f

<br>

# Text Augmentation: Synonym Replacement

In [None]:
from nltk.corpus import wordnet as wn
from nltk.corpus import stopwords
import random

def get_synonyms(word, pos=None, max_synonyms=5):
    synonyms = set()
    count = 0
    for syn in wn.synsets(word, pos=pos):
        for lemma in syn.lemmas():
            synonym = lemma.name().replace("_", " ").replace("-", "").lower()
            synonym = "".join([char for char in synonym if char in "abcdefghijklmnopqrstuvwxyz"])
            synonyms.add(synonym)
            count += 1
            if count >= max_synonyms:
                break
    if word in synonyms:
        synonyms.remove(word)
    return list(synonyms)

# Replace synonyms in your text
def synonym_replacement(words, n):
    words = words.split()
    new_words = words.copy()
    random_word_list = list(set([word for word in words if word not in stopwords.words('english')]))
    random.shuffle(random_word_list)
    num_replaced = 0
    for random_word in random_word_list:
        synonyms = get_synonyms(random_word)
        if len(synonyms) >= 1:
            synonym = random.choice(list(synonyms))
            new_words = [synonym if word == random_word else word for word in new_words]
            num_replaced += 1
            if num_replaced >= n:
                break
    sentence = ' '.join(new_words)
    return sentence

# Example 1
original_text = "Pyspark code which returns a reversed string or an array with reverse order of elements in column COL_A in dataframe MY_DF.The result is applied for all rows in MY_DF and stored in the column MY_ALIAS_A in the target dataframe."
augmented_text = synonym_replacement(original_text, n=3)  # Replace 3 words with synonyms
print(f"------ 1 ------")
print(f"text    : {original_text}")
print(f"summarry: {augmented_text}")

# Example 2
original_text = "Pyspark code which Returns a new row for each element with position in the array or map given by column COL_C in dataframe MY_DF in dataframe MY_DF.Unlike posexplode if the array map is null or empty then the row null null is produced.. It includes column COL_A and column COL_B in the explode process.The result is applied for all rows in MY_DF and stored in the column MY_ALIAS_A in the target dataframe."
augmented_text = synonym_replacement(original_text, n=3)  # Replace 3 words with synonyms
print(augmented_text)
print(f"------ 1 ------")
print(f"text    : {original_text}")
print(f"summarry: {augmented_text}")


------ 1 ------
text    : Pyspark code which returns a reversed string or an array with reverse order of elements in column COL_A in dataframe MY_DF.The result is applied for all rows in MY_DF and stored in the column MY_ALIAS_A in the target dataframe.
summarry: Pyspark code which returns a reversed string or an array with reverse order of element in chromatographycolumn COL_A in dataframe MY_DF.The result is applied for all quarrel in MY_DF and stored in the chromatographycolumn MY_ALIAS_A in the target dataframe.
Pyspark code which Returns a new row for each element with position in the regalia or map precondition by column COL_C in dataframe MY_DF in dataframe MY_DF.Unlike posexplode if the regalia map is null or empty then the row null null is produced.. It includes column COL_A and column COL_B in the blowup process.The result is applied for all rows in MY_DF and stored in the column MY_ALIAS_A in the target dataframe.
------ 1 ------
text    : Pyspark code which Returns a new ro

<br>
<br>
<br>

----
# <b>AUGMENTATION SECTION: EXPLODE ON STANDARDIZED IDENTIFIERS</b>
----


<br>

# Augmentation: Explode on MY_DIR

In [None]:
# Augmentation: Explode on MY_DIR

def fn_explode_my_dir(comment, code):

  random_write_dir_names = [
      'result_dir', 'output_folder', 'output_path', 'output_location', 'output_directory', 'output_dest', 'output_archive', 'output_store', 'output_dump',
      #'output_export', 'output_dir', 'out_directory', 'out_path', 'out_loc', 'out_location', 'out_destination', 'out_dir'
      ]
  random_read_dir_names  = [
      'data_in', 'input_data', 'input_folder', 'input_source', 'input_path', 'input_archive', 'input_store', 'input_dump', 'input_source', 'input_location',
      #'input_loc', 'archive_dir', 'archive_directory', 'inp_loc', 'inp_src', 'file_location', 'file_dir'
      ]

  result_array = []

  if '.read.' in code:
    for item in random_read_dir_names:
      # Pass 1 with upper case replacements
      result_array.append(comment.replace('MY_DIR', item.upper()) + '--delim--' + code.replace('MY_DIR', item.upper()))

      # Pass 2 with lower case replacements
      result_array.append(comment.replace('MY_DIR', item) + '--delim--' + code.replace('MY_DIR', item))

  elif '.write.' in code:
    for item in random_write_dir_names:
      # Pass 1 with upper case replacements
      result_array.append(comment.replace('MY_DIR', item.upper()) + '--delim--' + code.replace('MY_DIR', item.upper()))

      # Pass 2 with lower case replacements
      result_array.append(comment.replace('MY_DIR', item) + '--delim--' + code.replace('MY_DIR', item))

  result = result_array
  return result

fn_display_header("Explode on MY_DIR")
df_aug_explode_my_dir = df_concat_pass_1[(df_concat_pass_1['code_snippet'].str.contains('MY_DIR'))].copy()

fn_display_message(f' --> Display Rows with MY_DIR within code: {df_aug_explode_my_dir.shape[0]}')

# Apply explode function
df_aug_explode_my_dir['result_array']  = df_aug_explode_my_dir.apply(lambda x: fn_explode_my_dir(x['code_description'], x['code_snippet']), axis=1)

# Normalize the code on result_array
fn_display_message(f' --> Display Count Before Explode: {df_aug_explode_my_dir.shape[0]}')
df_aug_explode_my_dir = df_aug_explode_my_dir.explode('result_array').reset_index()
fn_display_message(f' --> Display Count After Explode: {df_aug_explode_my_dir.shape[0]}')

# Split and assign result_array to code and comment
df_aug_explode_my_dir['code_description'] = df_aug_explode_my_dir['result_array'].apply(lambda x: x.split('--delim--')[0])
df_aug_explode_my_dir['code_snippet'] = df_aug_explode_my_dir['result_array'].apply(lambda x: x.split('--delim--')[1])


--------------------------------------------------------------------------------
           Explode on MY_DIR
--------------------------------------------------------------------------------
 --> Display Rows with MY_DIR within code: 194
 --> Display Count Before Explode: 194
 --> Display Count After Explode: 3578


In [None]:
# DROP DUPLICATES from Explode dataset for MY_DIR

df_aug_explode_my_dir = df_aug_explode_my_dir[['code_description', 'code_snippet', 'import_line', 'Category', 'function', 'origin_str']].copy()

# Drop Duplicates
fn_display_message(f' --> Display Count Before dropping duplicates - {df_aug_explode_my_dir.shape[0]}')
df_aug_explode_my_dir = df_aug_explode_my_dir.drop_duplicates()
fn_display_message(f' --> Display Count After dropping duplicates - {df_aug_explode_my_dir.shape[0]}')

fn_display_header('Display Augmented Dataframe information: df_aug_explode_my_dir')
df_aug_explode_my_dir.info()

df_aug_explode_my_dir.head().style.set_properties(**{'text-align': 'left'})

 --> Display Count Before dropping duplicates - 3578
 --> Display Count After dropping duplicates - 3492
--------------------------------------------------------------------------------
           Display Augmented Dataframe information: df_aug_explode_my_dir
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
Index: 3492 entries, 0 to 3577
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   code_description  3492 non-null   object
 1   code_snippet      3492 non-null   object
 2   import_line       0 non-null      object
 3   Category          3492 non-null   object
 4   function          3492 non-null   object
 5   origin_str        3492 non-null   object
dtypes: object(6)
memory usage: 191.0+ KB


Unnamed: 0,code_description,code_snippet,import_line,Category,function,origin_str
0,Pyspark code to overwrite a DataFrame into a CSV file.The data is overwritten to file or directory RESULT_DIR .,"df.write.mode(""overwrite"").format(""csv"").save(RESULT_DIR)",,Input/Output,pyspark.sql.DataFrameReader.csv,original
1,Pyspark code to overwrite a DataFrame into a CSV file.The data is overwritten to file or directory result_dir .,"df.write.mode(""overwrite"").format(""csv"").save(result_dir)",,Input/Output,pyspark.sql.DataFrameReader.csv,original
2,Pyspark code to overwrite a DataFrame into a CSV file.The data is overwritten to file or directory OUTPUT_FOLDER .,"df.write.mode(""overwrite"").format(""csv"").save(OUTPUT_FOLDER)",,Input/Output,pyspark.sql.DataFrameReader.csv,original
3,Pyspark code to overwrite a DataFrame into a CSV file.The data is overwritten to file or directory output_folder .,"df.write.mode(""overwrite"").format(""csv"").save(output_folder)",,Input/Output,pyspark.sql.DataFrameReader.csv,original
4,Pyspark code to overwrite a DataFrame into a CSV file.The data is overwritten to file or directory OUTPUT_PATH .,"df.write.mode(""overwrite"").format(""csv"").save(OUTPUT_PATH)",,Input/Output,pyspark.sql.DataFrameReader.csv,original


<br>

# Augmentation: Explode on STANDARDIZED COLUMNS and LITERALS

In [None]:
# Augmentation: Explode on STANDARDIZED COLUMNS COL_A. COL_B etc

# Function to replace columns with values
def fn_col_explode_replacement(code, comment, search_str, replace_str, delim, upcase):
  result = ''
  delim = '--delim--'
  if upcase == 'Y':
    # Pass 1 with upper case replacements
    result = comment.replace(search_str, replace_str.upper()) + delim + code.replace(search_str, replace_str.upper())
  else:
    # Pass 2 with lower case replacements
    result = comment.replace(search_str, replace_str) + delim + code.replace(search_str, replace_str)

  return result

# Function to replace literals with values
def fn_lit_explode_replacement(code, comment, search_str, replace_str, delim, upcase):
  result = ''
  delim = '--delim--'
  result = comment.replace(search_str, replace_str.upper()) + delim + code.replace(search_str, replace_str.upper())

  return result

# Function to explode the code and comments
def fn_explode_cols_lits(comment, code, std_name):

  delim = '--delim--'

  col_a_list = ['employee_id', 'customer_age', 'product_quantity', 'order_total', #'transaction_date',# 'product_category'
      #, 'sales_representative', 'customer_zipcode', 'inventory_stock', 'supplier_name', 'region_code', 'payment_method', 'customer_segment', 'product_rating', 'shipment_status', 'customer_feedback', 'product_quality', 'order_source', 'supplier_payment_terms', 'employee_certifications'
      ]

  col_b_list = ['customer_gender', 'order_priority', 'product_brand', 'department_code', #'account_balance',# 'employee_role'
      #, 'customer_city', 'product_color', 'contract_duration', 'project_manager', 'store_location', 'customer_income', 'product_weight', 'campaign_channel', 'service_level', 'customer_purchase_history', 'product_sustainability', 'payment_security_code', 'sales_lead_time', 'customer_social_media'
      ]

  col_c_list = ['product_type', 'order_date', 'supplier_country', 'employee_department', #'customer_lifetime_value',# 'product_discount'
      #, 'payment_due_date', 'sales_region', 'customer_email', 'product_material', 'order_ship_date', 'supplier_contact', 'employee_manager', 'customer_phone', 'product_size', 'product_compliance', 'order_return_reason', 'supplier_lead_time', 'employee_workload', 'customer_referral_source'
      ]

  col_d_list = [
      'payment_status', 'sales_target', 'customer_address', 'product_expiry_date', 'order_quantity', 'supplier_rating'
      #, 'employee_tenure', 'customer_subscription', 'product_origin', 'payment_amount',
      #'sales_channel', 'customer_marital_status', 'product_warranty', 'order_status', 'supplier_location'
      ]

  col_e_list = ['employee_performance', 'customer_membership', 'product_features', 'payment_reference', 'sales_growth_rate', 'customer_loyalty'
      #, 'product_style', 'order_discount', 'supplier_type', 'employee_shift', 'customer_interests', 'product_season', 'payment_frequency', 'sales_conversion_rate'
      ]

  col_alias_list = ['target_column', 'output_column', 'result_col'] #, 'results'  ]

  lit_str_0_list = ["hello", "pyspark", "python"]
  lit_str_1_list = ["course", "sales", "order", "NA"]

  lit_int_0_list = ['1', '7', '-12', '33']
  lit_int_1_list = ['44', '-172', '165', '-775']
  lit_int_2_list = ['63', '-66', '1899', '1714']

  lit_dec_0_list = ['0.723', '2.71', '-4.56']
  lit_dec_1_list = ['7.21', '8.23', '-3.5501', '-0.23']

  result_array = []

  final_column_list = col_a_list          if std_name == 'COL_A' else []
  final_column_list = col_b_list          if std_name == 'COL_B' else final_column_list
  final_column_list = col_c_list          if std_name == 'COL_C' else final_column_list
  final_column_list = col_d_list          if std_name == 'COL_D' else final_column_list
  final_column_list = col_e_list          if std_name == 'COL_E' else final_column_list
  final_column_list = col_alias_list      if std_name == 'MY_ALIAS_A' else final_column_list

  final_column_list = lit_str_0_list      if std_name == 'LIT_STR_0' else final_column_list
  final_column_list = lit_str_1_list      if std_name == 'LIT_STR_1' else final_column_list
  final_column_list = lit_int_0_list      if std_name == 'LIT_INT_0' else final_column_list
  final_column_list = lit_int_1_list      if std_name == 'LIT_INT_1' else final_column_list
  final_column_list = lit_int_2_list      if std_name == 'LIT_INT_2' else final_column_list
  final_column_list = lit_dec_0_list      if std_name == 'LIT_DEC_0' else final_column_list
  final_column_list = lit_dec_1_list      if std_name == 'LIT_DEC_1' else final_column_list


  for idx, replace_std_name in enumerate(final_column_list):
    upcase = 'N'
    if idx % 2 == 0:
       upcase = 'Y'
    if std_name in code and std_name in comment:
      result_array.append(fn_col_explode_replacement(code, comment, std_name, replace_std_name, delim, upcase))
    else:
      result_array.append(comment + delim + code)

  result = result_array
  return result

fn_display_header("Explode on MY_DIR")
df_aug_explode_cols_lits = df_concat_pass_1[
  (df_concat_pass_1['code_snippet'].str.contains('COL_')) | (df_concat_pass_1['code_snippet'].str.contains('MY_ALIAS_')) | (df_concat_pass_1['code_snippet'].str.contains('LIT_'))
  ].copy()

fn_display_message(f' --> Display Rows with COL_ within code: {df_aug_explode_cols_lits.shape[0]}')

# Apply explode function in a loop for all standardized columns
df_aug_explode_cols_lits['result_array']  = df_aug_explode_cols_lits.apply(lambda x: [], axis=1)
for std_name in ['COL_A', 'COL_B', 'COL_C', 'COL_D', 'COL_E', 'MY_ALIAS_A', 'LIT_STR_0', 'LIT_STR_1', 'LIT_INT_0', 'LIT_INT_1', 'LIT_INT_2', 'LIT_DEC_0', 'LIT_DEC_1']:
  # Apply explode function
  df_aug_explode_cols_lits['result_array']  = df_aug_explode_cols_lits.apply(lambda x: fn_explode_cols_lits(x['code_description'], x['code_snippet'], std_name), axis=1)

  # Normalize the code on result_array
  fn_display_message(f' ----------- Process for {std_name} -----------')
  fn_display_message(f' --> Display Count Before Explode for {std_name}: {df_aug_explode_cols_lits.shape[0]}')
  df_aug_explode_cols_lits = df_aug_explode_cols_lits.explode('result_array')
  fn_display_message(f' --> Display Count After Explode  for {std_name}: {df_aug_explode_cols_lits.shape[0]}')

  # Drop Duplicates
  fn_display_message(f' --> Display Count Before dropping duplicates for {std_name}: {df_aug_explode_cols_lits.shape[0]}')
  df_aug_explode_cols_lits = df_aug_explode_cols_lits.drop_duplicates()
  fn_display_message(f' --> Display Count After dropping duplicates  for {std_name}: {df_aug_explode_cols_lits.shape[0]}')

  # Split and assign result_array to code and comment
  df_aug_explode_cols_lits['code_description'] = df_aug_explode_cols_lits['result_array'].apply(lambda x: x.split('--delim--')[0])
  df_aug_explode_cols_lits['code_snippet']     = df_aug_explode_cols_lits['result_array'].apply(lambda x: x.split('--delim--')[1])

fn_display_header('Display Augmented Dataframe information: df_aug_explode_cols_lits')
df_aug_explode_cols_lits = df_aug_explode_cols_lits.reset_index()
df_aug_explode_cols_lits.info()

--------------------------------------------------------------------------------
           Explode on MY_DIR
--------------------------------------------------------------------------------
 --> Display Rows with COL_ within code: 533
 ----------- Process for COL_A -----------
 --> Display Count Before Explode for COL_A: 533
 --> Display Count After Explode  for COL_A: 2132
 --> Display Count Before dropping duplicates for COL_A: 2132
 --> Display Count After dropping duplicates  for COL_A: 1934
 ----------- Process for COL_B -----------
 --> Display Count Before Explode for COL_B: 1934
 --> Display Count After Explode  for COL_B: 7736
 --> Display Count Before dropping duplicates for COL_B: 7736
 --> Display Count After dropping duplicates  for COL_B: 3587
 ----------- Process for COL_C -----------
 --> Display Count Before Explode for COL_C: 3587
 --> Display Count After Explode  for COL_C: 14348
 --> Display Count Before dropping duplicates for COL_C: 14348
 --> Display Count After

<br>

# CONCATENATE EXPLODED DATASETS

In [None]:
# CONCATENATE AUGMENTED DATASETS - PASS 2

df_concat_pass_2 = pd.concat([df_concat_pass_1, df_aug_explode_my_dir, df_aug_explode_cols_lits], ignore_index=True)

# Drop Duplicates
fn_display_message(f' --> Count df_concat_pass_1 : {df_concat_pass_1.shape[0]}')
fn_display_message(f' --> Count df_aug_explode_my_dir : {df_aug_explode_my_dir.shape[0]}')
fn_display_message(f' --> Count df_aug_explode_cols_lits : {df_aug_explode_cols_lits.shape[0]}')

fn_display_message(f' --> Display Count Before dropping duplicates - {df_concat_pass_2.shape[0]}')
df_concat_pass_2 = df_concat_pass_2.drop_duplicates()
fn_display_message(f' --> Display Count After dropping duplicates - {df_concat_pass_2.shape[0]}')

# Fix some common syntax error
df_concat_pass_2['code_description']  = df_concat_pass_2['code_description'].apply(lambda x: x.replace('..', '.').replace('  ', ' ').replace('. .', '.'))

fn_display_header('Display Augmented Dataframe information: df_concat_pass_2')
df_concat_pass_2.info()

fn_display_header('Display Augmented Dataframe rows: df_concat_pass_2')
df_concat_pass_2.head().style.set_properties(**{'text-align': 'left'})

 --> Count df_concat_pass_1 : 1395
 --> Count df_aug_explode_my_dir : 3492
 --> Count df_aug_explode_cols_lits : 15062
 --> Display Count Before dropping duplicates - 19949
 --> Display Count After dropping duplicates - 19949
--------------------------------------------------------------------------------
           Display Augmented Dataframe information: df_concat_pass_2
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19949 entries, 0 to 19948
Data columns (total 8 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   code_description  19949 non-null  object 
 1   code_snippet      19949 non-null  object 
 2   import_line       1344 non-null   object 
 3   Category          19949 non-null  object 
 4   function          19949 non-null  object 
 5   origin_str        19949 non-null  object 
 6   index             15062 non-null  float64
 7   result_

Unnamed: 0,code_description,code_snippet,import_line,Category,function,origin_str,index,result_array
0,Pyspark code to overwrite a DataFrame into a CSV file.The data is overwritten to file or directory MY_DIR .,"df.write.mode(""overwrite"").format(""csv"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.csv,original,,
1,Pyspark code to Read the CSV file as a DataFrame with 'nullValue' option set to 'Hyukjin Kwon'.The data is read from file or directory MY_DIR.,"df = spark.read.csv(MY_DIR, schema=df.schema, nullValue=""Hyukjin Kwon"")",,Input/Output,pyspark.sql.DataFrameReader.csv,original,,
2,Pyspark code to overwrite a DataFrame into a JSON file.The data is overwritten to file or directory MY_DIR .,"df.write.mode(""overwrite"").format(""json"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.format,original,,
3,Pyspark code which Specifies the input data source format.The data is overwritten to file or directory MY_DIR .,"df.write.mode(""overwrite"").format(""json"").save(MY_DIR)",,Input/Output,pyspark.sql.DataFrameReader.format,original,,
4,Pyspark code to Read the JSON file as a DataFrame.The data is read from file or directory MY_DIR.,df = spark.read.format('json').load(MY_DIR),,Input/Output,pyspark.sql.DataFrameReader.format,original,,


In [None]:
# UNCOMMENT TO DEBUG
#df_concat_pass_1.head(1000).style.set_properties(**{'text-align': 'left'})
#df_aug_explode_my_dir.head(1000).style.set_properties(**{'text-align': 'left'})
#df_aug_explode_cols_lits.head(1000).style.set_properties(**{'text-align': 'left'})
#df_concat_pass_2.head(1000).style.set_properties(**{'text-align': 'left'})

<br>

# CREATE FINAL DATASET AND DOWNLOAD

In [None]:
# CREATE FINAL AUGMENTED DATASET

df_final_augmented = df_concat_pass_2[['code_description', 'code_snippet', 'import_line', 'Category', 'function']].copy()

fn_display_header('Display df_final_augmented COLUMN/COUNT Details')
df_final_augmented.info()

--------------------------------------------------------------------------------
           Display df_final_augmented COLUMN/COUNT Details
--------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19949 entries, 0 to 19948
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   code_description  19949 non-null  object
 1   code_snippet      19949 non-null  object
 2   import_line       1344 non-null   object
 3   Category          19949 non-null  object
 4   function          19949 non-null  object
dtypes: object(5)
memory usage: 779.4+ KB


In [None]:
# WRITE FINAL DATASET TO CSV AND DOWNLOAD

DOWNLOAD_FLAG = 'Y'
if DOWNLOAD_FLAG == 'Y':
  df_final_augmented.to_csv('ETL_P4_data_augmentation_v1_20K.csv')

  from google.colab import files
  files.download('ETL_P4_data_augmentation_v1_20K.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>