# Pre-processing 

Pre-processing the data to add into Weka for sentiment analysis

In [1]:
import pandas as pd
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

In [2]:
# Download required NLTK data - to be run only once!
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /Users/smritiu/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/smritiu/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /Users/smritiu/nltk_data...


True

In [3]:
# Loading the dataset 
df = pd.read_csv("egyptian_hieroglyphs_sentiment_analysis.csv")

Using 
- stopwords to remove commonly used English words
- a lemmatizer that reduces words to its simplest form. 

In [4]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

The preprocess_text function takes a in a sentence (the text parameter). Then the following steps occur 
- converts all the text to lowercase
- removes punctuation
- tokenization (splitting words)
- filters out common words
- rejoins the cleaned words into a full sentence again.

In [5]:
def preprocess_text(text):
    # lowercase
    text = str(text).lower()  
    
    # remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))  
    
     # tokenize
    words = word_tokenize(text) 
    
    # remove stopwords
    words = [w for w in words if w not in stop_words]  

    # lemmatize
    words = [lemmatizer.lemmatize(w) for w in words] 
    return ' '.join(words)

The dataset will now have 2 new columns with the cleaned up texts

In [6]:
df["Cleaned Transliteration"] = df["Transliterated Text"].apply(preprocess_text)
df["Cleaned English"] = df["English Translation"].apply(preprocess_text)

In [None]:
# saving dataset
df.to_csv("cleaned_dataset.csv", index=False)

print("Pre-processing done - debug statement (just in case)")

Pre-processing done - debug statement (just in case)


In [9]:
df[["English Translation", "Cleaned English"]].head(10)

Unnamed: 0,English Translation,Cleaned English
0,Beautiful words spoken by the great god,beautiful word spoken great god
1,True of voice,true voice
2,He did good things for his lord,good thing lord
3,He listened to him as he spoke,listened spoke
4,His body trembled with his brother,body trembled brother
5,"Life, prosperity, and health",life prosperity health
6,"Life, prosperity, and health to his lord",life prosperity health lord
7,Before you is all good,good
8,I am the excellent one among the common people,excellent one among common people
9,The scribe is the one who acts in the house of...,scribe one act house life
