<h2>FILTERING DATA FOR THE GUESSING GAME</h2>
    
<p>Author: Saffanah Fathin
<p>Python Version 3.10
<p>Encoding: utf-8

Summary: This script of python codes in the following are about the process of filtering the corpus data for the sentence guessing game. With the scripts, there are two txt files generated. One file under the name of "dataforgame.txt" will be the main data for the game and another file of "data_words_text.txt" is used as the basis for the Hint 3 in the game.

Step 1: Importing the corpus data in the format of Collu by using NLTK and store it into a variable

In [1]:
#ignoring all the collumns except the sentences and POS
from nltk.corpus.reader.conll import ConllCorpusReader
data = ConllCorpusReader('./', ['en_ewt-ud-train_preproc.conllu'], 
                           ['ignore', 'words', 'ignore', 'pos', 'ignore', 'ignore', 'ignore', 'ignore', 'ignore', 'ignore'])

Step 2: Storing Necessary variables for the filtering process (sentences, pos, rare words, and sentence lengths

In [2]:
#Stopwords in English
from nltk.corpus import stopwords
en_stopwords = set(stopwords.words('english'))

In [3]:
#Storing necessary data into variables
words_list = data.words() #a list of words
sent_list = data.sents() #a list of sentences

#Part of Speech 
pos_words = data.tagged_words()
pos_sent = data.tagged_sents()

In [7]:
#calling the library
from nltk import FreqDist

#storing all the frequency of words in the corpus data
freqs = FreqDist(data.words())

In [8]:
#Words that occur less than 10 times
rare_words = [word for word, freq in freqs.items() if freq <= 10]

In [9]:
#Frequency of POS Tags:
pos_freqs = FreqDist([pos for (word, pos) in pos_words])

#Sentence lengths
sent_lens = [len(s) for s in sent_list]
len_freqs = FreqDist(sent_lens)

 <h3> Filtering the data: </h3>
    1. Eliminating sentences that are too long and too short

In [10]:
#eliminating too long and too short sentences from the data
datacut = [i for i in sent_list if 5<len(i)<20]

2. Eliminating sentences that contain rare words

In [11]:
#removing sentences that contain rare words
data_no_rare = [sublist for sublist in datacut if not any(x in sublist for x in rare_words)]

   3. Removing punctuations from the data
   
Note: I decide to remove each punctuations from the data (minus apostrophe for contraction words) for the simplicity of the data in the game

In [12]:
#remove the punctuations
from string import punctuation
#defining function to split string into characters
def split(word):
    return [char for char in word]

#convert punctuation into a list
punc = (split(punctuation))

#eliminating punctuation from the dataset
data_no_punc = [[w for w in data if w not in punc] for data in data_no_rare]

In [14]:
#convert every words into lowercase
dataclean = [[w.lower() for w in sublist] for sublist in data_no_punc]

Note: When I check the data manually, I realized there are punctuations that occurs more than one time, so I decide to also remove those

In [15]:
import string

#replacing some punctuations from words list in the data to empty string
punkt = '!-?.,:)'
new_dat = [[''.join(letter for letter in word if letter not in punkt) for word in sent if word]for sent in dataclean]

#removing those empty strings from the list
new_dataclean = [[word.translate(string.punctuation) for word in sent if word] for sent in new_dat]

    5. Further filtering for unsuitable sentences

In [16]:
#special case: delete sentences that starts with the word 'Carol'
carol = ["carol"]
new_dataclean_nocarol = [sublist for sublist in new_dataclean if not any(x in sublist for x in carol)]

<h3> Storing all necessary data into txt file for the purpose of accesssibility

In [17]:
#storing the final data consisting of sentences for the game in the txt file
with open("dataforgame.txt", 'w') as f:
        for item in new_dataclean_nocarol:
            s = " ".join(map(str, item))
            f.write(s+'\n')

In [18]:
#Storing the words list of the corpus into a txt file for the purpose of hint number 3 in main.py
with open("data_words_text.txt", 'w', encoding='UTF-8') as f:
    for item in words_list:
            s = str(item + " ")
            f.write(s)