# 1. Introduction

Within this task, the taught techniques revolving around Natural Language Process (NLP) are applied. The supplied dataset [`products.csv`](https://github.com/schmidt-marvin/ESI_2022_TecAA/tree/main/task03/provided_files/products.csv) contains a collection of customer reviews on drinkable and edible products. Each of these reviews mainly containins a `summary`, a `text`, as well as a `score` out of five stars. <br>

Given the `summary` and the `text` only, the goal is to train a classifier that is able to predict the `score`.<br>

This first colab (the "Preprocessing" colab) will perform all steps described in the [task description document](https://github.com/schmidt-marvin/ESI_2022_TecAA/tree/main/task03/provided_files/ML2022_Milestone_3_Task_Definition.pdf). To do this, we'll be performing the aforementioned steps in this order. The order of the steps to be performed is important, as disregarding it might lead to problems:
1. Removal of HTML structures
2. Removal of useless characters *(remove ! "_ $% & / ( ) = _ ˆ* ¡@)
3. Removal of Emoji *(remove 😠)*
4. Spelling correction *(gret -> great)*
5. Transpose text to lower case / Remove capital case *(I think -> i think)*
6. Removal of repeated words *(great great show -> great show)*
7. Transpose word contractions *(don't -> do not)*
8. Lemmatisation of all terms *(better -> good; great -> good)*


# 2. Preparations

## 2.1 Importing Libraries

In [None]:
# misc
from google.colab import output 
from google.colab import files
from io import StringIO
import pandas as pd

!pip install tqdm
import tqdm # progress bar

# removal of emoji characters
!pip install emoji
import emoji # contains function to check whether char is emoji

# spelling correction
!pip install textblob
from textblob import TextBlob

# regex operations 
import re

# lemmatization of terms
from nltk.stem import WordNetLemmatizer
import nltk
nltk.download("popular")
nltk.download('wordnet')
nltk.download('omw-1.4')

output.clear()

## 2.2 Importing dataset

The following statements download the raw, provided dataset and format it into a pandas dataframe.<br>

Due to multiple formatting issues, the file needed to be pre-processed in advance, as further explained within the code comments.

In [None]:
# import products.csv
!wget https://raw.githubusercontent.com/schmidt-marvin/ESI_2022_TecAA/main/task03/provided_files/products.csv

output.clear()

# read 'products.csv' as raw string from colab file system
with open('products.csv') as f:
    products_raw = f.read()

'''
File formatting does not allow for processing with pd out of the box.
 -> "Error tokenizing data" error. 

So, we need to preprocess the CSV.
'''

print("Unprocessed:\t" + str(products_raw.split("\n")[3]))

# modify 'products.csv' for compatibility
products_raw = re.sub(r'"', '', products_raw) # (1) remove all quotation marks
print("After fix (1):\t" + str(products_raw.split("\n")[3]))

products_raw = re.sub(r';', '', products_raw) # (2) remove all semicolons
print("After fix (2):\t" + str(products_raw.split("\n")[3]))

# fix bad seperator value (3)
'''
a comma is the seperator value.
however, there exist commas in the summary and text field.
Therefore, we'll receive an error as soon as any summary or text contain a comma.

"Text" field of the third entry:
This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis' The Lion, The Witch, and The Wardrobe - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.
                                                                        ^^^

Goal: Remove all commas after the ninth one.

There surely exists a fancy regex for this, but I couldn't find it.
So, we'll use a caveman approach.
(Don't judge me, if it looks stupid but it works it's not stupid.)
'''

# idea: join list of corrected strings for better performance compared to + operator
# first string will be the unmodified header
buf = [products_raw.split("\n")[0]]

for line in products_raw.split("\n")[1:]:
  # line needs to be modified, if there are more than nine commas
  while line.count(',') > 9:
    # remove the last comma (will repeat until correct number is reached)
    line = ''.join(line.rsplit(",", 1))
  buf.append("\n" + line)

products_raw_fixed = ''.join(buf)

# show fixed data
print("After fix (3):\t" + str(products_raw_fixed.split("\n")[3]))


# write modified 'products.csv' to colab file system
with open('products_modified.csv', 'w') as f:
  f.write(products_raw_fixed)


df_products_raw = pd.read_csv("products_modified.csv", sep=",", index_col="Id")

df_products_raw.head()

Unprocessed:	"3,B000LQOCH0,ABXLMWJIXXAIN,""Natalia Corres """"Natalia Corres"""""",1,1,4,1219017600,""""""Delight"""" says it all"",""This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis' """"The Lion, The Witch, and The Wardrobe"""" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.""";;;;;;;;;;;;;;;;;;;;;;;;
After fix (1):	3,B000LQOCH0,ABXLMWJIXXAIN,Natalia Corres Natalia Corres,1,1,4,1219017600,Delight says it all,This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  A

Unnamed: 0_level_0,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
3,B000LQOCH0,ABXLMWJIXXAIN,Natalia Corres Natalia Corres,1,1,4,1219017600,Delight says it all,This is a confection that has been around a fe...
4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
5,B006K2ZZ7K,A1UQRSCLF8GW1T,Michael D. Bigham M. Wassir,0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


We've now successfully loaded the dataframe. <br>

Now, since we'll only need the fields `Score`, `Summary` and `text`, we need to modify our original data.

In [None]:
relevant_features = ['Summary', 'Text', 'Score']

df_products = df_products_raw[relevant_features]
df_products[:10]

Unnamed: 0_level_0,Summary,Text,Score
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Good Quality Dog Food,I have bought several of the Vitality canned d...,5
2,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,1
3,Delight says it all,This is a confection that has been around a fe...,4
4,Cough Medicine,If you are looking for the secret ingredient i...,2
5,Great taffy,Great taffy at a great price. There was a wid...,5
6,Nice Taffy,I got a wild hair for taffy and ordered this f...,4
7,Great! Just as good as the expensive brands!,This saltwater taffy had great flavors and was...,5
8,Wonderful,tasty taffyThis taffy is so good. It is very...,5
9,Yay Barley,Right now I'm mostly just sprouting this so my...,5
10,Healthy Dog Food,This is a very healthy dog food. Good for thei...,5


## 2.3 Utility functions for dataframe

Now, since all of our data is compressed within a dataframe, we need to find some kind of way to apply some kind of processing function onto **all text fields** of the dataframe.

In [None]:
def get_summary_values():
  return df_products['Summary'].tolist()

def replace_summary_values(new_summary_values):
  df_products.loc[:, 'Summary'] = new_summary_values

def get_text_values():
  return df_products['Text'].tolist()

def replace_text_values(new_text_values):
  df_products.loc[:, 'Text'] = new_text_values

# 3. Preprocessing

## 3.1 Removal of HTML tags

As we've noticed after having already finished the preprocessing, the input dataset contains a few HTML tags, such as break characters and embedded links. These need to be removed first, as an intact HTML tag structure is easier to detect using Regex statements.

In [None]:
html_tag = re.compile('<.*?>') 

def cleanhtml(line):
  output_line = re.sub(html_tag, ' ', str(line))
  return output_line

# print(cleanhtml("I ordered this for my wife as it was reccomended by our daughter.  She has this almost every morning and likes all flavors.  She's happy, I'm happy!!!<br /><a href=""http://www.amazon.com/gp/product/B001EO5QW8"">McCANN'S Instant Irish Oatmeal, Variety Pack of Regular, Apples & Cinnamon, and Maple & Brown Sugar, 10-Count Boxes (Pack of 6)</a>"))

In [None]:
# before
df_products.loc[25:27,:]

Unnamed: 0_level_0,Summary,Text,Score
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
25,Please sell these in Mexico!!,I have lived out of the US for over 7 yrs now ...,5
26,Twizzlers - Strawberry,Product received is as advertised.<br /><br />...,5
27,Nasty No flavor,The candy is just red No flavor . Just plan ...,1


In [None]:
# process summary values
summary_values = get_summary_values()
summary_values_processed = [cleanhtml(entry) for entry in summary_values]
replace_summary_values(summary_values_processed)

# process text values
text_values = get_text_values()
text_values_processed = [cleanhtml(entry) for entry in text_values]
replace_text_values(text_values_processed)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[selected_item_labels] = value


In [None]:
# after
df_products.loc[25:27,:]

Unnamed: 0_level_0,Summary,Text,Score
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
25,Please sell these in Mexico!!,I have lived out of the US for over 7 yrs now ...,5
26,Twizzlers - Strawberry,Product received is as advertised. Twizzlers...,5
27,Nasty No flavor,The candy is just red No flavor . Just plan ...,1


## 3.2 Removal of useless characters

This step is quite straight-forward. We'll first define a utility function to perform the task. Then, we'll execute it for every value within the `summary` and `text` fields.

In [None]:
useless_characters = "!\"$%&/()=_ˆ¡@"

def remove_useless_characters(line):
  return ''.join(char for char in str(line) if char not in useless_characters)

In [None]:
# process summary values
summary_values = get_summary_values()
summary_values_processed = [remove_useless_characters(entry) for entry in summary_values]
replace_summary_values(summary_values_processed)

# process text values
text_values = get_text_values()
text_values_processed = [remove_useless_characters(entry) for entry in text_values]
replace_text_values(text_values_processed)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[selected_item_labels] = value


In [None]:
# view results
df_products[:10]

Unnamed: 0_level_0,Summary,Text,Score
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Good Quality Dog Food,I have bought several of the Vitality canned d...,5
2,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,1
3,Delight says it all,This is a confection that has been around a fe...,4
4,Cough Medicine,If you are looking for the secret ingredient i...,2
5,Great taffy,Great taffy at a great price. There was a wid...,5
6,Nice Taffy,I got a wild hair for taffy and ordered this f...,4
7,Great Just as good as the expensive brands,This saltwater taffy had great flavors and was...,5
8,Wonderful,tasty taffyThis taffy is so good. It is very...,5
9,Yay Barley,Right now I'm mostly just sprouting this so my...,5
10,Healthy Dog Food,This is a very healthy dog food. Good for thei...,5


## 3.3 Removal of emoji

This step follows the same principle as the one shown above.

In [None]:
def remove_emoji(line):
  return ''.join(char for char in str(line) if not emoji.is_emoji(char))

In [None]:
# process summary values
summary_values = get_summary_values()
summary_values_processed = [remove_emoji(entry) for entry in summary_values]
replace_summary_values(summary_values_processed)

# process text values
text_values = get_text_values()
text_values_processed = [remove_emoji(entry) for entry in text_values]
replace_text_values(text_values_processed)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[selected_item_labels] = value


In [None]:
# view results
df_products[:10]

Unnamed: 0_level_0,Summary,Text,Score
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Good Quality Dog Food,I have bought several of the Vitality canned d...,5
2,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...,1
3,Delight says it all,This is a confection that has been around a fe...,4
4,Cough Medicine,If you are looking for the secret ingredient i...,2
5,Great taffy,Great taffy at a great price. There was a wid...,5
6,Nice Taffy,I got a wild hair for taffy and ordered this f...,4
7,Great Just as good as the expensive brands,This saltwater taffy had great flavors and was...,5
8,Wonderful,tasty taffyThis taffy is so good. It is very...,5
9,Yay Barley,Right now I'm mostly just sprouting this so my...,5
10,Healthy Dog Food,This is a very healthy dog food. Good for thei...,5


## 3.4 Spelling correction

This step applies the `.correct()` function of the `TextBlob` library onto all string entries of the dataset.<br>

Again, we'll first define a utility function and execute it afterwards.

In [None]:
def correct_spelling(line_to_correct):
  text_blob = TextBlob(line_to_correct)
  return str(text_blob.correct())

text = "We hav founf something nice"
print("original:\t" + text)
print("corrected:\t" + correct_spelling(text))

original:	We hav founf something nice
corrected:	He had found something nice


The spell checker employed here takes more than 10 hours to complete, given the size of our data set. Therefore, we've decided to not perform a spell check, at least for now.

As a proof of concept, only the first 10 entries are processed.

In [None]:
# process text values (with progress bar)
text_values = get_text_values()
text_values_processed = []
text_values_to_process = 10 # len(text_values)

with tqdm.tqdm(total=len(summary_values), desc="Processing of text values") as pbar:
  for text in text_values:
    if text_values_to_process > 0:
      text_values_processed.append(correct_spelling(str(text)))
      text_values_to_process = text_values_to_process - 1
    else:
      text_values_processed.append(text)
    pbar.update(1)

replace_text_values(text_values_processed)

# process summary values (with progess bar)
summary_values = get_summary_values()
summary_values_processed = []
summary_values_to_process = 10 # # len(summary_values)

with tqdm.tqdm(total=len(summary_values), desc="Processing of summary values") as pbar:
  for summary in summary_values:
    if summary_values_to_process > 0:
      summary_values_processed.append(correct_spelling(str(summary)))
      summary_values_to_process = summary_values_to_process - 1
    else:
      summary_values_processed.append(summary)
    pbar.update(1)

replace_summary_values(summary_values_processed)

output.clear()

In [None]:
# view results
df_products[:10]

Unnamed: 0_level_0,Summary,Text,Score
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Good Quality Dog Good,I have bought several of the Vitality canned d...,5
2,Not as Advertised,Product arrived labelled as Lumbo Halted Peanu...,1
3,Delight says it all,His is a connection that has been around a few...,4
4,Rough Medicine,Of you are looking for the secret ingredient i...,2
5,Great staff,Great staff at a great price. There was a wid...,5
6,Vice Puffy,I got a wild hair for staff and ordered this f...,4
7,Great Must as good as the expensive bands,His saltpeter staff had great favors and was v...,5
8,Wonderful,taste taffyThis staff is so good. It is very...,5
9,May Barley,Right now I'm mostly just sprouting this so my...,5
10,Healthy Dog Good,His is a very healthy dog food. Good for their...,5


## 3.5 Transpose to lower case

This step can be performed with the built-in function of the Python string class `.lower()`.

In [None]:
# process summary values
summary_values = get_summary_values()
summary_values_processed = [entry.lower() for entry in summary_values]
replace_summary_values(summary_values_processed)

# process text values
text_values = get_text_values()
text_values_processed = [entry.lower() for entry in text_values]
replace_text_values(text_values_processed)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[selected_item_labels] = value


In [None]:
# view results
df_products[:10]

Unnamed: 0_level_0,Summary,Text,Score
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,good quality dog good,i have bought several of the vitality canned d...,5
2,not as advertised,product arrived labelled as lumbo halted peanu...,1
3,delight says it all,his is a connection that has been around a few...,4
4,rough medicine,of you are looking for the secret ingredient i...,2
5,great staff,great staff at a great price. there was a wid...,5
6,vice puffy,i got a wild hair for staff and ordered this f...,4
7,great must as good as the expensive bands,his saltpeter staff had great favors and was v...,5
8,wonderful,taste taffythis staff is so good. it is very...,5
9,may barley,right now i'm mostly just sprouting this so my...,5
10,healthy dog good,his is a very healthy dog food. good for their...,5


## 3.6 Removal of repeated words

This function requires the definition of a regular expression. The goal within each regex substitution is the removal of one duplicate. The substitution needs to be executed until no duplicates are left.

`Paris in the the the spring.` --> `Paris in the the spring.` --> `Paris in the spring.`

In [None]:
def remove_word_duplicates(line): 
  output_line = re
  expression = re.compile(r'\b(\w+)\s+\1\b')
  sub_term = r'\1'

  output_line = expression.sub(sub_term, line)

  if output_line != line: # term was removed ^= we need to check for further duplicates
      return remove_word_duplicates(output_line) 
  else:
      return output_line 

# print(remove_word_duplicates('but but i cant see see a a a sea here here.'))

In [None]:
# process summary values
summary_values = get_summary_values()
summary_values_processed = [remove_word_duplicates(entry) for entry in summary_values]
replace_summary_values(summary_values_processed)

# process text values
text_values = get_text_values()
text_values_processed = [remove_word_duplicates(entry) for entry in text_values]
replace_text_values(text_values_processed)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[selected_item_labels] = value


In [None]:
# view results
df_products[:10]

Unnamed: 0_level_0,Summary,Text,Score
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,good quality dog good,i have bought several of the vitality canned d...,5
2,not as advertised,product arrived labelled as lumbo halted peanu...,1
3,delight says it all,his is a connection that has been around a few...,4
4,rough medicine,of you are looking for the secret ingredient i...,2
5,great staff,great staff at a great price. there was a wid...,5
6,vice puffy,i got a wild hair for staff and ordered this f...,4
7,great must as good as the expensive bands,his saltpeter staff had great favors and was v...,5
8,wonderful,taste taffythis staff is so good. it is very...,5
9,may barley,right now i'm mostly just sprouting this so my...,5
10,healthy dog good,his is a very healthy dog food. good for their...,5


## 3.7 Transpose word contractions

In [None]:
contractions=[ # extended list extracted from 'contractions.csv' (https://www.kaggle.com/datasets/ishivinal/contractions?resource=download)
    (r'\'aight', 'alright'),
    (r'ain\'t', 'is not'),
    (r'amn\'t', 'am not'),
    (r'aren\'t', 'are not'),
    (r'can\'t', 'cannot'),
    (r'\'cause', 'because'),
    (r'could\'ve', 'could have'),
    (r'couldn\'t', 'could not'),
    (r'couldn\'t\'ve', 'could not have'),
    (r'daren\'t', 'dare not'),
    (r'daresn\'t', 'dare not'),
    (r'dasn\'t', 'dare not'),
    (r'didn\'t', 'did not'),
    (r'doesn\'t', 'does not'),
    (r'don\'t', 'do not'),
    (r'dunno', 'do not know'),
    (r'd\'ye', 'do you'),
    (r'e\'er', 'ever'),
    (r'everybody\'s', 'everybody is'),
    (r'everyone\'s', 'everyone is'),
    (r'finna', 'fixing'),
    (r'g\'day', 'good day'),
    (r'gimme', 'give me'),
    (r'giv\'n', 'given'),
    (r'gonna', 'going to'),
    (r'gon\'t', 'go not'),
    (r'gotta', 'got to'),
    (r'hadn\'t', 'had not'),
    (r'had\'ve', 'had have'),
    (r'hasn\'t', 'has not'),
    (r'haven\'t', 'have not'),
    (r'he\'d', 'he had'),
    (r'he\'ll', 'he will'),
    (r'he\'s', 'he is'),
    (r'he\'ve', 'he have'),
    (r'how\'d', 'how did'),
    (r'howdy', 'how do you do'),
    (r'how\'ll', 'how will'),
    (r'how\'re', 'how are'),
    (r'how\'s', 'how is'),
    (r'i\'d	i', 'had'),
    (r'i\'d\'ve', 'i would have'),
    (r'i\'ll', 'i will'),
    (r'i\'m', 'i am'),
    (r'i\'m\'a', 'i am about to'),
    (r'i\'m\'o', 'i am going to'),
    (r'innit', 'is it not'),
    (r'i\'ve', 'i have'),
    (r'isn\'t', 'is not'),
    (r'it\'d', 'it would'),
    (r'it\'ll', 'it will'),
    (r'it\'s', 'it is'),
    (r'iunno', 'i do not know'),
    (r'let\'s', 'let us'),
    (r'ma\'am', 'madam'),
    (r'mayn\'t', 'may not'),
    (r'may\'ve', 'may have'),
    (r'methinks', 'me thinks'),
    (r'mightn\'t', 'might not'),
    (r'might\'ve', 'might have'),
    (r'mustn\'t', 'must not'),
    (r'mustn\'t\'ve', 'must not have'),
    (r'must\'ve', 'must have'),
    (r'needn\'t', 'need not'),
    (r'ne\'er', 'never'),
    (r'o\'clock', 'of the clock'),
    (r'o\'er', 'over'),
    (r'ol\'', 'old'),
    (r'oughtn\'t', 'ought not'),
    (r'\'s', 'is'),
    (r'shalln\'t', 'shall not'),
    (r'shan\'t', 'shall not'),
    (r'she\'d', 'she had'),
    (r'she\'ll', 'she shall'),
    (r'she\'s', 'she has'),
    (r'should\'ve', 'should have'),
    (r'shouldn\'t', 'should not'),
    (r'shouldn\'t\'ve', 'should not have'),
    (r'somebody\'s', 'somebody is'),
    (r'someone\'s', 'someone is'),
    (r'something\'s', 'something is'),
    (r'so\'re', 'so are'),
    (r'that\'ll', 'that will'),
    (r'that\'re', 'that are'),
    (r'that\'s', 'that is'),
    (r'that\'d', 'that would'),
    (r'there\'d', 'there would'),
    (r'there\'ll', 'there will'),
    (r'there\'re', 'there are'),
    (r'there\'s', 'there is'),
    (r'these\'re', 'these are'),
    (r'these\'ve', 'these have'),
    (r'they\'d', 'they would'),
    (r'they\'ll', 'they will'),
    (r'they\'re', 'they are'),
    (r'they\'ve', 'they have'),
    (r'this\'s', 'this is'),
    (r'those\'re', 'those are'),
    (r'those\'ve', 'those have'),
    (r'\'tis', 'it is'),
    (r'to\'ve', 'to have'),
    (r'\'twas', 'it was'),
    (r'wanna', 'want to'),
    (r'wasn\'t', 'was not'),
    (r'we\'d', 'we would'),
    (r'we\'d\'ve', 'we would have'),
    (r'we\'ll', 'we will'),
    (r'we\'re', 'we are'),
    (r'we\'ve', 'we have'),
    (r'weren\'t', 'were not'),
    (r'what\'d', 'what did'),
    (r'what\'ll', 'what will'),
    (r'what\'re', 'what are'),
    (r'what\'s', 'what is'),
    (r'what\'ve', 'what have'),
    (r'when\'s', 'when is'),
    (r'where\'d', 'where did'),
    (r'where\'ll', 'where will'),
    (r'where\'re', 'where are'),
    (r'where\'s', 'where is'),
    (r'where\'ve', 'where have'),
    (r'which\'d', 'which would'),
    (r'which\'ll', 'which will'),
    (r'which\'re', 'which are'),
    (r'which\'s', 'which is'),
    (r'which\'ve', 'which have'),
    (r'who\'d', 'who would'),
    (r'who\'d\'ve', 'who would have'),
    (r'who\'ll', 'who will'),
    (r'who\'re', 'who are'),
    (r'who\'s', 'who is'),
    (r'who\'ve', 'who have'),
    (r'why\'d', 'why did'),
    (r'why\'re', 'why are'),
    (r'why\'s', 'why is'),
    (r'willn\'t', 'will not '),
    (r'won\'t', 'will not'),
    (r'wonnot', 'will not '),
    (r'would\'ve', 'would have'),
    (r'wouldn\'t', 'would not'),
    (r'wouldn\'t\'ve', 'would not have'),
    (r'y\'all', 'you all '),
    (r'y\'all\'d\'ve', 'you all would have '),
    (r'y\'all\'re', 'you all are '),
    (r'you\'d', 'you would'),
    (r'you\'ll', 'you will'),
    (r'you\'re', 'you are'),
    (r'you\'ve', 'you have')
]

contractions_regex = [(re.compile(contracted_word), uncontracted_word) for (contracted_word, uncontracted_word) in contractions]

# removes exactly one contraction
def remove_word_contraction(line):
  # check term for each possible contraction (iterates list of contractions)
  for (contracted_word, uncontracted_word) in contractions_regex: 
    (s, count) = re.subn(contracted_word, uncontracted_word, line)
    # return result if substitution was successful
    if count > 0: 
      return s
  # no substitution was necessary
  return line

# removes all contractions
def remove_word_contractions(line):
  output_line = remove_word_contraction(line)
  # term was removed ^= we need to check for further duplicates
  if (line != output_line):  
    return remove_word_contractions(output_line)
  # no term removed ^= no contractions left
  else:
    return output_line

In [None]:
# process summary values
summary_values = get_summary_values()
summary_values_processed = [remove_word_contractions(entry) for entry in tqdm.tqdm(summary_values, leave=False)]
replace_summary_values(summary_values_processed)

# process text values
text_values = get_text_values()
text_values_processed = [remove_word_contractions(entry) for entry in tqdm.tqdm(text_values)]
replace_text_values(text_values_processed)

output.clear()

In [None]:
# view results
df_products[:10]

Unnamed: 0_level_0,Summary,Text,Score
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,good quality dog good,i have bought several of the vitality canned d...,5
2,not as advertised,product arrived labelled as lumbo halted peanu...,1
3,delight says it all,his is a connection that has been around a few...,4
4,rough medicine,of you are looking for the secret ingredient i...,2
5,great staff,great staff at a great price. there was a wid...,5
6,vice puffy,i got a wild hair for staff and ordered this f...,4
7,great must as good as the expensive bands,his saltpeter staff had great favors and was v...,5
8,wonderful,taste taffythis staff is so good. it is very...,5
9,may barley,right now i am mostly just sprouting this so m...,5
10,healthy dog good,his is a very healthy dog food. good for their...,5


## 3.8 Lemmatization of all terms

For the lemmatization process, we'll be using the `WordNetLemmatizer` from `ntlk`. For each of the words entered into the lemmatizer, a generalization of the word is returned. In order to preserve performance, we'll initialize the lemmatizer object only once.

In [None]:
def lemmatize_words_in_line(lemmatizer_instance, line):
  lemmatized_words = [lemmatizer.lemmatize(word) for word in line.split(" ")]
  return " ".join(lemmatized_words)

text = "well the years start coming"

lemmatizer = WordNetLemmatizer()
print(lemmatize_words_in_line(lemmatizer, text))

well the year start coming


In [None]:
lemmatizer = WordNetLemmatizer()

# process summary values
summary_values = get_summary_values()
summary_values_processed = [lemmatize_words_in_line(lemmatizer, str(entry)) for entry in tqdm.tqdm(summary_values)]
replace_summary_values(summary_values_processed)

# process text values
text_values = get_text_values()
text_values_processed = [str(lemmatize_words_in_line(lemmatizer, str(entry))) for entry in tqdm.tqdm(text_values)]
replace_text_values(text_values_processed)

output.clear()

In [None]:
# view results
df_products[:10]

Unnamed: 0_level_0,Summary,Text,Score
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,good quality dog good,i have bought several of the vitality canned d...,5
2,not a advertised,product arrived labelled a lumbo halted peanut...,1
3,delight say it all,his is a connection that ha been around a few ...,4
4,rough medicine,of you are looking for the secret ingredient i...,2
5,great staff,great staff at a great price. there wa a wide...,5
6,vice puffy,i got a wild hair for staff and ordered this f...,4
7,great must a good a the expensive band,his saltpeter staff had great favor and wa ver...,5
8,wonderful,taste taffythis staff is so good. it is very...,5
9,may barley,right now i am mostly just sprouting this so m...,5
10,healthy dog good,his is a very healthy dog food. good for their...,5


# 4. Usage in upcoming colab files

In this section, the achieved result needs to be exported in some form, such that it's easy to work with in tasks 2-4. Furthermore, an import statement will be prepared to provide "out-of-the-box" functionality within upcoming colab documents.

In [None]:
# export to CSV
df_products.to_csv("products_preprocessed.csv", index=True, header = True, sep=',')

In [None]:
# import from CSV
!wget https://raw.githubusercontent.com/schmidt-marvin/ESI_2022_TecAA/main/task03/intermediate_files/products_preprocessed.csv
output.clear()

df_products_preprocessed = pd.read_csv("products_preprocessed.csv", sep=",", index_col="Id")
df_products_preprocessed.head()

Unnamed: 0_level_0,Summary,Text,Score
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,good quality dog good,i have bought several of the vitality canned d...,5
2,not a advertised,product arrived labelled a lumbo halted peanut...,1
3,delight say it all,his is a connection that ha been around a few ...,4
4,rough medicine,of you are looking for the secret ingredient i...,2
5,great staff,great staff at a great price. there wa a wide...,5
