<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-research-and-practice/blob/main/natural-language-processing-with-pytorch/03-neural-networks-foundational/01_dataset_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Yelp Review Dataset preprocessing

##Setup

In [1]:
import collections
import numpy as np
import pandas as pd
import re

from argparse import Namespace

In [None]:
from google.colab import files
files.upload() # upload kaggle.json file

In [3]:
%%shell

mkdir -p ~/.kaggle
mv kaggle.json ~/.kaggle/
ls ~/.kaggle
chmod 600 /root/.kaggle/kaggle.json

# download dataset from kaggle> URL: https://www.kaggle.com/hhalalwi/yelp-light?select=raw_test.csv
kaggle datasets download -d hhalalwi/yelp-light
unzip -qq yelp-light.zip

kaggle.json
Downloading yelp-light.zip to /content
 95% 167M/176M [00:01<00:00, 102MB/s]
100% 176M/176M [00:01<00:00, 113MB/s]




##Dataset preproccessing

In [4]:
args = Namespace(
    raw_train_dataset_csv="raw_train.csv",
    raw_test_dataset_csv="raw_test.csv",
    train_proportion=0.7,
    val_proportion=0.3,
    output_munged_csv="reviews_with_splits_full.csv",
    seed=1337
)

In [5]:
# Read raw data
train_reviews = pd.read_csv(args.raw_train_dataset_csv, header=None, names=['rating', 'review'])
train_reviews = train_reviews[~pd.isnull(train_reviews.review)]
test_reviews = pd.read_csv(args.raw_test_dataset_csv, header=None, names=['rating', 'review'])
test_reviews = test_reviews[~pd.isnull(test_reviews.review)]

In [6]:
train_reviews.head()

Unnamed: 0,rating,review
0,1,"Unfortunately, the frustration of being Dr. Go..."
1,2,Been going to Dr. Goldberg for over 10 years. ...
2,1,I don't know what Dr. Goldberg was like before...
3,1,I'm writing this review to give you a heads up...
4,2,All the food is great here. But the best thing...


In [7]:
test_reviews.head()

Unnamed: 0,rating,review
0,1,Ordered a large Mango-Pineapple smoothie. Stay...
1,2,Quite a surprise! \n\nMy wife and I loved thi...
2,1,"First I will say, this is a nice atmosphere an..."
3,2,I was overall pretty impressed by this hotel. ...
4,1,Video link at bottom review. Worst service I h...


In [8]:
# Unique classes
set(train_reviews.rating)

{1, 2}

In [9]:
# Splitting train by rating
by_rating = collections.defaultdict(list)
for _, row in train_reviews.iterrows():
  by_rating[row.rating].append(row.to_dict())  # Create dict

In [13]:
# Create split data
final_list = []
np.random.seed(args.seed)

for _, item_list in sorted(by_rating.items()):
  np.random.shuffle(item_list)

  n_total = len(item_list)
  n_train = int(args.train_proportion * n_total)
  n_val = int(args.val_proportion * n_total)

  # Give data point a split attribute
  for item in item_list[:n_train]:
    item["split"] = "train"

  for item in item_list[n_train: n_train + n_val]:
    item["split"] = "val"

  # Add to final list
  final_list.extend(item_list)

In [16]:
final_list[0]["rating"], final_list[0]["review"], final_list[0]["split"]

(1,
 "The entrance was the #1 impressive thing about this place, as it is completely a surprise and almost shocks you.  I won't give it up here but it's worth at least getting a drink to experience that part.\\n\\nGreeter was great.\\n\\nLounge singer was very 70s and campy, but liked him none the less.\\n\\n\\nFood is pretty below average in pretty much every way possible.\\n\\nBread they bring out is pretty bad and dries out within minutes of being at the table to the point it's inedible.\\n\\nSalads were small, boring and WAY drowned in dressing.\\n\\nCalamari is $15 or so, and is half the size of what they'd give you at Capitol Grille, and is pretty bland and mushy.\\n\\nthe pasta tasted as though it were cooked in rancid water.  the chef was nice enough to come out and bring my wife a personalized dish that he eats, and it tasted the same.  we smiled and pretended to enjoy it just to keep from looking like big complainers.  We literally don't complain usually, but her first dish w

In [17]:
# now do the same for test set
for _, row in test_reviews.iterrows():
  row_dict = row.to_dict()
  row_dict["split"] = "test"
  final_list.append(row_dict)

In [18]:
# Write split data to file
final_reviews = pd.DataFrame(final_list)

In [19]:
final_reviews.split.value_counts()

train    392000
val      168000
test      38000
Name: split, dtype: int64

In [20]:
final_reviews.review.head()

0    The entrance was the #1 impressive thing about...
1    I'm a Mclover, and I had no problem\nwith the ...
2    Less than good here, not terrible, but I see n...
3    I don't know if I can ever bring myself to go ...
4    Food was OK/Good but the service was terrible....
Name: review, dtype: object

In [21]:
final_reviews[pd.isnull(final_reviews.review)]

Unnamed: 0,rating,review,split


In [22]:
# Preprocess the reviews
def preprocess_text(text):
  if type(text) == float:
    print(text)
  text = text.lower()
  text = re.sub(r"([.,!?])", r" \1", text)
  text = re.sub(r"[^a-zA-Z.,!?]+", r" ", text)
  return text

In [23]:
final_reviews.review = final_reviews.review.apply(preprocess_text)

In [24]:
final_reviews.head()

Unnamed: 0,rating,review,split
0,1,the entrance was the impressive thing about th...,train
1,1,"i m a mclover , and i had no problem nwith the...",train
2,1,"less than good here , not terrible , but i see...",train
3,1,i don t know if i can ever bring myself to go ...,train
4,1,food was ok good but the service was terrible ...,train


In [30]:
# total unique value count
list(set(final_reviews.rating))

[1, 2]

In [31]:
# now label to rating
final_reviews["rating"] = final_reviews.rating.apply({1: "negative", 2: "positive"}.get)
final_reviews.head()

Unnamed: 0,rating,review,split
0,negative,the entrance was the impressive thing about th...,train
1,negative,"i m a mclover , and i had no problem nwith the...",train
2,negative,"less than good here , not terrible , but i see...",train
3,negative,i don t know if i can ever bring myself to go ...,train
4,negative,food was ok good but the service was terrible ...,train


In [32]:
# finally, save the preproccessed dataset
final_reviews.to_csv(args.output_munged_csv, index=False)