## Custom Dataset Generation

For custom dataset generation, I used a dataset related to food from kaggle named "Amazon Fine Food Reviews" (https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews).

At first I created a list of some common foods (18 in total) and also a list of some helper words that are likely to be found in sentences related to foods. For example: a sentence with food will most likely have words like tasty, delicious, savoury etc. in it.

In [1]:
import pandas as pd
import numpy as np

In [2]:
ENTITY_NAMES = ['gravy','ramen','burger','pizza','pasta','wings',
                'coke','sprite','water','fanta','pepsi','seven up', 
                'biriyani','rice', 'pulao', 
                'bread', 'flat bread', 'rice bowl']
helper_words = ['flavour','flavours','tasty','delicious','juicy','spicy','soft','chewy']

Duplicate reviews were removed and texts were converted to lowercase

In [4]:
sentence_df = pd.read_csv("Reviews.csv")
sentence_df = sentence_df.drop_duplicates("Text")
sentence_df["Text"] = sentence_df["Text"].str.lower()

In [5]:
for food_no,food in enumerate(ENTITY_NAMES):
    print("Total # of reviews with the word '{}' in them: ".format(food), sentence_df['Text'].str.contains(food).sum())
    print("Total # of reviews with the word '{}' and helper words in them: ".format(food),(sentence_df['Text'].str.contains(food) & sentence_df['Text'].str.contains('|'.join(helper_words))).sum())

Total # of reviews with the word 'gravy' in them:  1402
Total # of reviews with the word 'gravy' and helper words in them:  255
Total # of reviews with the word 'ramen' in them:  1078
Total # of reviews with the word 'ramen' and helper words in them:  453
Total # of reviews with the word 'burger' in them:  1789
Total # of reviews with the word 'burger' and helper words in them:  469
Total # of reviews with the word 'pizza' in them:  1945
Total # of reviews with the word 'pizza' and helper words in them:  472
Total # of reviews with the word 'pasta' in them:  5970
Total # of reviews with the word 'pasta' and helper words in them:  1319
Total # of reviews with the word 'wings' in them:  454
Total # of reviews with the word 'wings' and helper words in them:  140
Total # of reviews with the word 'coke' in them:  887
Total # of reviews with the word 'coke' and helper words in them:  169
Total # of reviews with the word 'sprite' in them:  149
Total # of reviews with the word 'sprite' and hel

For every food, first 500 reviews were taken and split into lines. Then only if the line had a food and a helper word in it, it was written in a file called "food_sentences.txt". Some minor text formatting was done like removing tags i.e. break. Also for a food, reviews with previous foods were discarded to avoid repitition.

In [6]:
take_first = 500
save_file_name = 'food_sentences.txt'
with open(save_file_name,'w') as file:
    for food_no,food in enumerate(ENTITY_NAMES):
        for sentence in sentence_df['Text'][sentence_df['Text'].str.contains(food)].iloc[:take_first]:
            for line in sentence.split('.'):
                if any(x in line for x in ENTITY_NAMES) and all(x not in line for x in ENTITY_NAMES[:food_no]) and any(x in line for x in helper_words):
                    file.write(line.strip().replace('<br />','')+'\n')

## The sentences saved in the file were then manually annotated using an NER annotator tool and saved in the file "annotations.json". The annotator tool that was used is: https://tecoholic.github.io/ner-annotator/ .