This file is used to download all Roman Urdu datasets from Hugging face and Kaggle and finally merging them with unified formatting. The labels/numbering used in the file correspond to the numbering used in the Roman Urdu doc: https://docs.google.com/document/d/10Lm92pXTbTkKz2ksimhKUSXIh1vfLqCHtJ6gwsqUVGg/edit?usp=drivesdk

## Installing Kaggle for downloading Kaggle datasets
## Note: Use this link for kaggle API file: https://www.kaggle.com/general/74235
Easiest way to download kaggle data in Google Colab
You will be asked to upload a file
Choose the kaggle.json file that you downloaded[link text](https://)

In [None]:
# imports for kaggle
!pip install -q kaggle
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"shafaqfatimamughal","key":"111f8492df1c869edb8e04cb256d37c6"}'}

In [None]:
! mkdir ~/.kaggle # making directory
! cp kaggle.json ~/.kaggle/ # copy kaggle.json file there.
! chmod 600 ~/.kaggle/kaggle.json # change the permissions of the file
! kaggle datasets list # you can check if everything's okay by running this command.

ref                                                                title                                                size  lastUpdated          downloadCount  voteCount  usabilityRating  
-----------------------------------------------------------------  --------------------------------------------------  -----  -------------------  -------------  ---------  ---------------  
victorsoeiro/netflix-tv-shows-and-movies                           Netflix TV Shows and Movies                           2MB  2022-05-15 00:01:23          15556        450  1.0              
ruchi798/data-science-job-salaries                                 Data Science Job Salaries                             7KB  2022-06-15 08:59:12           4372        153  1.0              
zusmani/petrolgas-prices-worldwide                                 Petrol/Gas Prices Worldwide                          10KB  2022-06-24 01:25:33           1921         90  1.0              
imoore/age-dataset                           

## Installing DataSets for HuggingFace

In [None]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.3.2-py3-none-any.whl (362 kB)
[K     |████████████████████████████████| 362 kB 7.2 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.5.0-py3-none-any.whl (140 kB)
[K     |████████████████████████████████| 140 kB 43.0 MB/s 
[?25hCollecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 40.4 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 38.5 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 8.0 MB/s 
Collecting responses<0.19
  Downloading responses-0.



# Other general imports

In [None]:
from datasets import load_dataset
import pandas as pd
import matplotlib.pyplot as plt
data = {} # dictionary to store all datasets as pandas dataframes in values

# 1. Daraz Roman Urdu Reviews (Kaggle)

Source: https://www.kaggle.com/datasets/naveedhn/daraz-roman-urdu-reviews

Paper Link: https://doi.org/10.7717/peerj-cs.472

Total Instances: 3923

Summary: Labelled as Positive, Negative and, Neutral in "Sentiment" column

Validity: Used in 2 different papers that can be found on the kaggle link.

In [None]:
!kaggle datasets download -d naveedhn/daraz-roman-urdu-reviews
!unzip "/content/daraz-roman-urdu-reviews.zip"

Downloading daraz-roman-urdu-reviews.zip to /content
  0% 0.00/518k [00:00<?, ?B/s]
100% 518k/518k [00:00<00:00, 96.0MB/s]
Archive:  /content/daraz-roman-urdu-reviews.zip
  inflating: Daraz Labelled Review Dataset with Sentiments and Features.xlsx  


In [None]:
d1 = pd.read_excel("/content/Daraz Labelled Review Dataset with Sentiments and Features.xlsx")
d1["Sentiment"].replace(['Positive', 'Negative', 'Neutral'], [0, 1, 2], inplace=True) # formatting sentiment lablelling
d1 = d1.filter(["Reviews", "Sentiment"]) # just want 2 columns
d1.rename(columns={"Reviews": "Text"}, inplace=True) # renaming column to ensure same format of all datasets

data["d1"] = d1 # saving dataframe in data dictionary
# d1.head

                                                   Text  Sentiment
15    ya chez wkt p aa gai thi aur kaam b sai kar rh...          2
18    may ny order kya tha mi a in golden aur unho n...          2
21              achi timing hai sirf  dino may agya hai          2
35    sb sy acha phone hai agar aap sy ko huwai k sa...          2
37    sb sy acha product hai sirf  din may mil gya h...          2
...                                                 ...        ...
3894            eendhan mo asar aur awaz kaam generator          2
3898  yeh generator achchca hai mein nay usay aik ha...          2
3913           acha brand aur khobsorat bhaari masnoaat          2
3916  auto safai aur shamsi mutabqat ke af aal mojoo...          2
3919  waqt par masool hwa aaccha lagta hai mein iss ...          2

[568 rows x 2 columns]


# 2. Multisenti (GitHub LUMS)
Source: https://github.com/haroonshakeel/multisenti

Paper Link: https://sbasse.lums.edu.pk/sites/default/files/2020-10/4%20%26%205%20-%20Deep_Learning_Methods_for_Short_Informal_and_Multilingual_Text_Analytics_V4.pdf

Total Instances: 20735

Summary: Labelled data with 3 sentiments (0: negative, 1: neutral, 2: negative)

Validity: From Lums University Archive for Machine Learning, from their Multilingual Textual Analysis archive.

In [None]:
xtrain = pd.read_csv("/content/X_train.tsv", sep='\t')
xtest= pd.read_csv("/content/X_test.tsv", sep='\t')
ytrain = pd.read_csv("/content/y_train.tsv", sep='\t')
ytest = pd.read_csv("/content/y_test.tsv", sep='\t')

df_train = pd.merge(xtrain, ytrain, left_index=True, right_index=True)
df_test = pd.merge(xtest, ytest, left_index=True, right_index=True)

frames = [df_train, df_test]
d2 = pd.concat(frames)

d2 = d8[d8["lang"] != "English"]

d2.rename(columns={"Tweet": "Text", "label" : "Sentiment"}, inplace=True) # renaming column to ensure same format of all datasets
d2["Sentiment"].replace([0, 1, 2], [1, 2, 0], inplace=True) # formatting sentiment lablelling
d2.drop(columns=["lang"], inplace=True)

data["d2"] = d2
# d2.head

FileNotFoundError: ignored

# 3. Dataset Card for roman_urdu_hate_speech (HuggingFace)
Source: https://huggingface.co/datasets/roman_urdu_hate_speech

Total instances: 10013

Summary: The Roman Urdu Hate-Speech and Offensive Language Detection (RUHSOLD) dataset is a Roman Urdu dataset of tweets annotated by experts in the relevant language. The authors develop the gold standard for two sub-tasks.
The tweets are labeled in numbericals (0 for abusive and 1 for neutral)

Paper Link: https://aclanthology.org/2020.emnlp-main.197/

Validity: Highly Valid as used in paper for hatespeech and offensive language recognition in Roman Urdu.

In [None]:
dataset = load_dataset("roman_urdu_hate_speech")
d3 = pd.DataFrame(columns=["Text", "Sentiment"]) # convert dataset to pandas dataframe with just 2 columns

for i in dataset["train"]: # even adding the training data to our pandas dataframe
  dx = pd.DataFrame({"Text": [i['tweet']], "Sentiment": [i['label']]})
  d3 = d3.append(dx)

d3["Sentiment"].replace([0, 1], [1, 2], inplace=True) # formatting sentiment lablelling

data["d3"] = d3
# d3.head

No config specified, defaulting to: roman_urdu_hate_speech/Coarse_Grained
Reusing dataset roman_urdu_hate_speech (/root/.cache/huggingface/datasets/roman_urdu_hate_speech/Coarse_Grained/1.1.0/07c4e54fa3ec497e042b98bb64a75d841a4b7f7ae3ecb068368f5abc4b7f2e1b)


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
d7[d7['Sentiment']==2].shape

(3850, 2)

# 4. Dataset 11000 Reviews

Source: https://archive.ics.uci.edu/ml/machine-learning-databases/00621/ 

Total Instances: 11000

Summary: Labelled data with 2 sentiments (pos, neg)

Validity: From Unversity California Irvine from their Machine Learning archive.

Note: You need to have the dataset downloaded and converted from xls to xlsx
The dataset also needs to be uploaded to colab session storage 

In [None]:
# d4 = pd.read_csv("/content/Dataset 11000 Reviews.tsv", sep='\t')
d4 = pd.read_excel("/content/Dataset-11000-Reviews.xlsx")
d4.columns = ["Sentiment", "Text"]
d4["Sentiment"].replace(['pos', 'neg'], [0, 1], inplace=True) # formatting sentiment lablelling

data["d4"] = d4
# d4.head

# 5. Roman Urdu E-Commerce Dataset
Source: https://github.com/bilalbaloch1/RU-BiSLTM 

Paper Link: https://www.mdpi.com/2076-3417/12/7/3641/htm 

Total Instances: 26,824

Summary: Urdu reviews are labeled as either Positive (1), Negative(0), and Neutral (2).
Validity: Highly valid dataset as it consists of reviews given by users on both Daraz and Twitter. So since it contains textual sentiment analysis, it is highly valid for our purposes.

Comments on Consistency: In E-commerce, decision-makers are always interested in obtaining feedback from a user so as to trigger the right decision at the right time. The sentiment analysis in this paper is done to help assess other systems such as recommendation systems, hate speech detection, and spam detection, among others. The labeling of positive, negative, and neutral is in a general context and so consistent with our needs.


In [None]:
d5 = pd.read_csv("/content/RUECD.csv")

# 3. Roman Urdu Tagged Dataset (Kaggle)

Source: https://www.kaggle.com/datasets/nikhar25/roman-urdu-tagged-dataset

Total Instances: 21902

Summary: Contains the sentiment of roman urdu sentences. Sentences labelled as positive, negative or neutral.

Validity: Highly used dataset with over 30+ downloads in previous version and 30 downloads in latest version. It has been cleaned from previous version since labelling was incorrect in some instances. 

Paper Link: N/A

In [None]:
!kaggle datasets download -d nikhar25/roman-urdu-tagged-dataset
!unzip "/content/roman-urdu-tagged-dataset.zip"

Downloading roman-urdu-tagged-dataset.zip to /content
  0% 0.00/357k [00:00<?, ?B/s]
100% 357k/357k [00:00<00:00, 75.9MB/s]
Archive:  /content/roman-urdu-tagged-dataset.zip
  inflating: Roman Urdu Tagged Dataset.csv  


In [None]:
d6 = pd.read_csv("/content/Roman Urdu Tagged Dataset.csv")
d6.rename(columns={'Sentiment (POS/NEG/NEU)': "Sentiment"}, inplace=True) # renaming column to ensure same format of all datasets
d6["Sentiment"].replace(['Positive', 'Negative', 'Neutral'], [0, 1, 2], inplace=True) # formatting sentiment lablelling

data["d6"] = d6 # saving dataframe in data dictionary
# d6.head
print(d6[d6["Sentiment"]==2])

                                            Text  Sentiment
5                 bht km melta hai markt me eb q          2
8      har product ki qeemat likhi honi chahiye           2
15          1 star denay ki waja bhi bata dain..          2
22           plz brost ki reciepe share karen...          2
25                                 Kia scene hai          2
...                                          ...        ...
21934                      roz dekhoga yeh drama          2
21935                              han han dekho          2
21936              muje bhi btaoo koi acha drama          2
21937                yehi wala dekh lo mere sath          2
21939                       kiya haal ha tumara?          2

[11560 rows x 2 columns]


# 1. Performing Natural Language Processing on Roman Urdu Datasets  (HuggingFace)

Source: https://huggingface.co/datasets/roman_urdu

Paper Link: http://paper.ijcsns.org/07_book/201801/20180117.pdf

Total Instances: 20000

Summary: Each row consists of a short Urdu text, followed by a sentiment label. The labels are one of Positive, Negative, and Neutral. Note that the original source file is a comma-separated values file.

Validity: Used in paper for NLP in Roman Urdu making it highly valid.

Labels:
0 : Positive
1 : Negative
2 : Neutral

In [None]:
dataset = load_dataset("roman_urdu")
d0 = pd.DataFrame(columns=["Text", "Sentiment"]) # convert dataset to pandas dataframe with just 2 columns

for i in dataset["train"]: # even adding the training data to our pandas dataframe
  dx = pd.DataFrame({"Text": [i['sentence']], "Sentiment": [i['sentiment']]})
  d0 = d0.append(dx)

data["d0"] = d0

d0.head

Downloading builder script:   0%|          | 0.00/1.59k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/941 [00:00<?, ?B/s]

Using custom data configuration default


Downloading and preparing dataset roman_urdu/default (download: 1.55 MiB, generated: 1.56 MiB, post-processed: Unknown size, total: 3.11 MiB) to /root/.cache/huggingface/datasets/roman_urdu/default/1.1.0/43ac4dc4994f29c6390770f065e1e9126933d523fc76e57fe3d648302b8e7232...


Downloading data:   0%|          | 0.00/1.63M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/20229 [00:00<?, ? examples/s]

Dataset roman_urdu downloaded and prepared to /root/.cache/huggingface/datasets/roman_urdu/default/1.1.0/43ac4dc4994f29c6390770f065e1e9126933d523fc76e57fe3d648302b8e7232. Subsequent calls will reuse this data.


  0%|          | 0/1 [00:00<?, ?it/s]

<bound method NDFrame.head of                                                  Text Sentiment
0   Sai kha ya her kisi kay bus ki bat nhi hai lak...         0
0                                           sahi bt h         0
0                                         Kya bt hai,         0
0                                          Wah je wah         0
0                                Are wha kaya bat hai         0
..                                                ...       ...
0            Hamari jese awam teli laga k mazay leti          1
0   Kaash hum b parhay likhay hotayKabhi likhtay g...         1
0   Bahi sayasat kufrrr ha saaaf bttttt ha qanon s...         1
0                      aanti toh gussa e kr gai hain          1
0   mai b sirf shadi kanry ki waja say imran khan ...         0

[20229 rows x 2 columns]>

In [None]:
dpos = d0[d0["Sentiment"] == 0 ]
dneg = d0[d0["Sentiment"] == 1 ]
dneu = d0[d0["Sentiment"] == 2 ]
print(dneu.head(5))
print(dneu.head(-5))

                                 Text Sentiment
0                          hakeqat hy         2
0          Aor aisy bahut km hain ryt         2
0                        jee ye to he         2
0        Hmm jysa kro gy wysa bhru gy         2
0  Ye kia hoa raha hain Pakistan main         2
                                                 Text Sentiment
0                                          hakeqat hy         2
0                          Aor aisy bahut km hain ryt         2
0                                        jee ye to he         2
0                        Hmm jysa kro gy wysa bhru gy         2
0                  Ye kia hoa raha hain Pakistan main         2
..                                                ...       ...
0                                           jeni chk.         2
0                     End pe comment check kr anti ka         2
0   bt ap public py kr rhy hBakol  k " bhai jamori...         2
0                                 han han jamhoriat h         2
0       

In [None]:
d0[d0["Sentiment"] == 0 ]

Unnamed: 0,Text,Sentiment
0,Sai kha ya her kisi kay bus ki bat nhi hai lak...,0
0,sahi bt h,0
0,"Kya bt hai,",0
0,Wah je wah,0
0,Are wha kaya bat hai,0
...,...,...
0,Phir theek hai,0
0,HahahahahaH hansi ni ruk rai comment pardh k,0
0,Wah Wah cha gye aunty lolxxxx,0
0,Ha lash h. Aunty bhe sahe h,0


#4. Roman Urdu Data Set with Label (Kaggle)

Source: https://www.kaggle.com/datasets/huzzefakhan/roman-urdu-data-set-with-label

Total Instances: 19665

Summary: Roman Urdu sentences with their corresponding sentiment.

Validity: Highly used dataset with 66 downloads.

Paper Link: N/A

In [None]:
!kaggle datasets download -d huzzefakhan/roman-urdu-data-set-with-label
!unzip "/content/roman-urdu-data-set-with-label.zip"

Downloading roman-urdu-data-set-with-label.zip to /content
  0% 0.00/622k [00:00<?, ?B/s]
100% 622k/622k [00:00<00:00, 100MB/s]
Archive:  /content/roman-urdu-data-set-with-label.zip
replace Roman Urdu DataSet.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: Roman Urdu DataSet.csv  


In [None]:
d3 = pd.read_csv("/content/Roman Urdu DataSet.csv")
d3.rename(columns={'Sentences': "Text"}, inplace=True) # renaming column to ensure same format of all datasets
d3 = d3.filter(["Text", "Sentiment"]) # just want 2 columns
d3["Sentiment"].replace(['Positive', 'Negative', 'Neutral'], [0, 1, 2], inplace=True) # formatting sentiment lablelling
d3["Sentiment"].replace(['Neative'], [1], inplace=True) # taking care of anormalies

data["d3"] = d3 # saving dataframe in data dictionary
# d3.head

#5. Daraz DataSet 4K Roman Urdu (Kaggle)

Source: https://www.kaggle.com/datasets/ahtshamrao/daraz-dataset-4k-roman-urdu

Total Instances: 3646

Summary: Contains public reviews of many users on Daraz labelled with 3 sentiments.

Validity: Not Highly regarded as low usability, small dataset and low download rate reported on Kaggle

Paper Link: N/A

In [None]:
!kaggle datasets download -d ahtshamrao/daraz-dataset-4k-roman-urdu
!unzip /content/daraz-dataset-4k-roman-urdu.zip

Downloading daraz-dataset-4k-roman-urdu.zip to /content
  0% 0.00/237k [00:00<?, ?B/s]
100% 237k/237k [00:00<00:00, 68.9MB/s]
Archive:  /content/daraz-dataset-4k-roman-urdu.zip
  inflating: Standardized Dataset.csv  


In [None]:
d4 = pd.read_csv("/content/Standardized Dataset.csv")
d4.rename(columns={"Reviews": "Text", "Label" : "Sentiment"}, inplace=True) # renaming column to ensure same format of all datasets
d4 = d4.filter(["Text", "Sentiment"]) # just want 2 columns

data["d4"] = d4 # saving dataframe in data dictionary
# d4.head

# 6. Roman Urdu Dataset 
Source: https://archive.ics.uci.edu/ml/machine-learning-databases/00458/ 

Total Instances: 20229

Summary: Labelled data with 3 sentiments (Positive, Negative, Neutral)

Note: You need to have the dataset downloaded and converted to from tsv to csv

Validity: From Unversity California Irvine from their Machine Learning archive.

The dataset also needs to be uploaded to colab session storage

In [None]:
d5 = pd.read_csv("/content/Roman Urdu DataSet.csv")
d5.columns = ["Text", "Sentiment", "nan"]
d5.drop("nan", inplace=True, axis=1)
d5["Sentiment"].replace(['Positive', 'Negative', 'Neutral'], [0, 1, 2], inplace=True) # formatting sentiment lablelling

data["d5"] = d5
# d5.head

## Merging the Datasets

Creating a new pandas dataframe where all the datasets will be merged

In [None]:
# print(data.keys())
maindata = pd.DataFrame(columns=["Text", "Sentiment"]) # making a new dataframe where we will merge all the formatted dataframes

rows = 0
for i in data.keys(): # accessing the formatted dataframes that were stored in dictionaries
    print("Shape of", i, ":", data[i].shape)
    df = data[i] 
    rows += data[i].shape[0]
    maindata = maindata.append(df) # adding the single dataset dataframe to the main dataframe

# print(rows, maindata.shape[0])

Removing irregularities like invalid, empty or NaN entries from dataset

In [None]:
print("Initial shape", maindata.shape)

maindata.drop_duplicates(keep="first", inplace=True)
maindata["Text"] = maindata.Text.str.strip()
maindata = maindata[maindata["Text"] != " "]
maindata = maindata[maindata["Text"] != ""]
maindata = maindata[maindata["Sentiment"] != " "]
maindata = maindata[maindata["Sentiment"] != ""]
maindata = maindata[maindata["Sentiment"] != "Neative"]
maindata["Sentiment"] = pd.to_numeric(maindata["Sentiment"]) # making sure that the sentiment tagging is in numbers (0 for pos, 1 for neg, 2 for neut)
maindata = maindata.dropna() # drop all entries with nan values

print("Final shape", maindata.shape)

In [None]:
maindata.to_csv("Dataset.csv", index=None) # convert the dataframe to a csv file and not adding the indexes

Visualizing the the tagging distribution of the merged dataset

In [None]:
y = list(maindata["Sentiment"].value_counts()) # getting frequency of all the 3 sentiments
plt.plot()
plt.bar(["Positive", "Negative", "Neutral"], y) # plottign frequency against labels in a bar chart
plt.xlabel("Sentiment POS/NEG/NEU")
plt.ylabel("Frequency")
plt.title("Sentiment frequency of all Roman Urdu Datasets")
plt.show()

In [None]:
print("Total instances:", maindata.shape)

In [None]:
files.download("/content/Dataset.csv") # dowload the final merged dataset 