<center> <img src = "upy2.png"> </center>

### <center> Universidad Politécnica de Yucatán </center>
### <center> Data Engineering </center>
### <center> Natural Language Processing </center>
### <center> Isabel Cámara Salinas </center>
### <center> Ricardo Armando Centeno Santos </center>
### <center> Mayte Alejandra Chi Poot </center>
### <center> Victor Rodrigo Uribe Hernández </center>
### <center> 9th quarter </center>
### <center> Mario Campos Soberanis </center>
### <center> November 11, 2022 </center>

In [2]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer 
from nltk.tokenize import word_tokenize
import sklearn
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
#nltk.download('wordnet')

In [3]:
class PreProcessor:
    
    
    def __init__(self):
        '''
        The PreProcessor class provides different methods to preprocess a string that the method will receive.
        Attributes:
            pos: stores the position that we assign for the lemmatization. Our base case is the position 'noun', because
            at the moment that we call the lemmatize method it is more common and useful to lemmatize in nouns. This
            position could be modified by the user.
            special_chars: stores the characters that we want to remove to the strings in the remove noise method.
            regex_dict: stores the dictionary of the regular expressions that we want to extract.
        
        Methods:
            stemWords: It receives a string and transforms the words to its root form.
            lemmatizeWords: It receives a string and reduces the words to a word existing in the language.
            removeNoise: It receives a string and remove all the characters that are included in the attribute of special
            characters to obtain a cleaned string.
            wordTokenize: It receives a string and splits into words.
            phraseTokenize: It receives a string and splits into phrases.
            textNormalization: It receives a string and removes the regular expressions that we previously defined in 
            the attribute of regex dict to obtain a string with more coherence.
            extractRegex: It receives a string and extract the regula expressions that we defined in the previous attributes.
            cleaning: This method calls the methods that we consider are necessary for the preprocessing of the tweet 
            database.
        '''
        self.pos = 'n'
        self.special_chars = ",.@?!¬-\''=()"
        self.regex_dict = {'Tags' : r'@[A-Za-z0-9]+', 
                      '# symbol' : r'#', 
                      'RT' : r'RT', 
                      'Links' : r'https?://\S+',
                      'Not letters': r'[^A-Za-z\s]+',
                      'Phone' : r'\+[0-9]{12}'}
    
    def stemWords(self, string):
        '''
        Input: String
        Process: Receives a string and call the PorterStemmer function of the nltk library of python to do the stemming 
        process. We call the word tokenize method to split the string and obtain the root of each word, and then when
        the words are stemmed, we join the string again.
        Return: Stemmed string
        '''
        ps = PorterStemmer()
        stem = list(map(ps.stem, self.wordTokenize(string)))
        stemmed = ' '.join(stem)
        return stemmed
    
    def lemmatizeWords(self, string):
        '''
        Input: String
        Process: Receives a string and call the WordNetLemmatizer function of the nltk library of python to do the 
        lemmatizing process which is going to receive the word and the attribute of position. We call the word tokenize
        method to split the string and obtain the word of the language of each word, and then when the words are 
        lemmatized, we join the string again.
        Return: Lemmatized string
        '''
        wnl = WordNetLemmatizer()
        lemm = [wnl.lemmatize(word, self.pos) for word in self.wordTokenize(string)]
        lematized = ' '.join(lemm)
        return lematized
    
    def removeNoise(self, string):
        '''
        Input: String
        Process: Receives the string and makes all the characters lower, then check the string and if there are 
        characters that are part of the attribute of special characters, it replaces the character with nothing and
        finally join the string again.
        Return: cleaned string
        '''
        clean_string = string.lower()
        for char in self.special_chars:
            clean_string = clean_string.replace(char, "")
        splitted = self.wordTokenize(clean_string)
        cleaned = [w.replace(" ", "") for w in splitted if len(w) > 0]
        clean_string = " ".join(cleaned)
        return clean_string
    
    def wordTokenize(self, string):
        '''
        Input: String
        Process: Receives the string and split all the words adding the words to the list, using the word tokenize 
        function of the nltk.
        Return: list of tokenized words
        '''
        tokenized = word_tokenize(string)
        return tokenized
    
    def phraseTokenize(self, string):
        '''
        Input: String
        Process: Receives the string and split the phrases using the reference '. ' and then adding the phrase to the list
        Return: list of tokenized phrases
        '''
        cleaned = string.split('. ')
        return cleaned
    
    def textNormalization(self, string):
        '''
        Input: String
        Process: Receives and splits the string and remove the regular expressions of the string. To make this, 
        it iterates over the attribute that stores the regular expression dictionary.
        Return: String without the regular expresions
        '''
        for key in self.regex_dict.keys():
            string = re.sub(self.regex_dict[key], '', string)
        normalized =  " ".join(self.wordTokenize(string))
        return normalized
    
    def extractRegex(self, string):
        '''
        Input: String
        Process: Receives the string, creates a dictionary to store the regular expressions that we extract of each
        string and calls the method of text nrmalization to also generate a cleaned string that we also will return to
        the user.
        Return: dictionary with the regular expressions that we extract and a cleaned string without the regular expressions
        '''
        dict_found_strings = dict()
        
        for key in self.regex_dict.keys():
            found_strings = re.findall(self.regex_dict[key], string)
            dict_found_strings[key] = found_strings

        replaced_string = self.textNormalization(string = string)
        return dict_found_strings, replaced_string
    
    def cleaning(self, data):
        '''
        Input: Data
        Process: This method call the methods that we consider necessary for the preprocessing of the tweets database.
        In this case we select the text normalization, the stemmatization words, and the remove noise methods. This 
        method receives in this case a column of the dataframe and returns the column of the dataframe preprocesed.
        Return: Preprocessed Data
        '''
        #text Normalization
        data = data.apply(self.textNormalization)
        
        #stem words
        data = data.apply(self.stemWords)
        
        #Remove Noise
        data = data.apply(self.removeNoise)
        
        return data
        

In [4]:
df = pd.read_csv("tweets.csv")
tweets = df['tweet']

In [5]:
prep = PreProcessor()

In [6]:
df['cleaning_tweets'] = prep.cleaning(tweets)
df

Unnamed: 0,tweet,senti,cleaning_tweets
0,"@united Oh, we are sure it's not planned, but ...",0,oh we are sure it not plan but it occur absolu...
1,History exam studying ugh,0,histori exam studi ugh
2,@unnitallman yeah looks like that only! &quot;...,0,yeah look like that onli quotbusyquot is fuck ...
3,Loves twitter,4,love twitter
4,@Mbjthegreat i really dont want AT&amp;T phone...,0,i realli dont want atampt phone servicethey su...
...,...,...,...
293,@criticalpath Such an awesome idea - the cont...,4,such an awesom idea the continu learn program ...
294,"Talk is Cheap: Bing that, I?ll stick with Goog...",0,talk is cheap bing that ill stick with googl
295,@CTesdahl well you can always digg or stumble....,4,well you can alway digg or stumbleboth have ai...
296,Stopped to have lunch at McDonalds. Chicken Nu...,4,stop to have lunch at mcdonald chicken nuggets...


In [7]:
df = df.drop(['tweet'], axis = 1)
df

Unnamed: 0,senti,cleaning_tweets
0,0,oh we are sure it not plan but it occur absolu...
1,0,histori exam studi ugh
2,0,yeah look like that onli quotbusyquot is fuck ...
3,4,love twitter
4,0,i realli dont want atampt phone servicethey su...
...,...,...
293,4,such an awesom idea the continu learn program ...
294,0,talk is cheap bing that ill stick with googl
295,4,well you can alway digg or stumbleboth have ai...
296,4,stop to have lunch at mcdonald chicken nuggets...


#### Dividing the data set into training and test sets

We divided the dataset in two parts. The first one is the training set; this one has the 90% of the data. The second one is the test set; this one has the 10% of the data.

In [8]:
test = df[218:298]
test

Unnamed: 0,senti,cleaning_tweets
218,2,adob cs commerci by goodbi silverstein
219,4,i would like to go for a drive with these guy ...
220,2,googl vision deliv packag with driverless car ...
221,0,wtf is the point of delet tweet if they can st...
222,0,i just wrote and entir stori and it didnt save...
...,...,...
293,4,such an awesom idea the continu learn program ...
294,0,talk is cheap bing that ill stick with googl
295,4,well you can alway digg or stumbleboth have ai...
296,4,stop to have lunch at mcdonald chicken nuggets...


In [9]:
df = df.drop(range(268,298),axis=0)
df

Unnamed: 0,senti,cleaning_tweets
0,0,oh we are sure it not plan but it occur absolu...
1,0,histori exam studi ugh
2,0,yeah look like that onli quotbusyquot is fuck ...
3,4,love twitter
4,0,i realli dont want atampt phone servicethey su...
...,...,...
263,0,go to the dentist later
264,2,that look an aw lot like one of nike privat je...
265,2,i thought it wa a selfdriv carlt must be a man...
266,2,watch a programm about the life of hitler it o...


In [10]:
D = df.to_numpy().tolist()
c = df['senti'].to_numpy().tolist()
C = list(set(c))

#### Naive Bayes training model 

This is the model that we are going to use to train the data. We are going to use the Naive Bayes algorithm.
First, we initialized the "logprior" and the "loglikelihood".The logprior is the sum of the loglikelihood of each word in the training set. The loglikelihood is the sum of the loglikelihood of each word in the training set. Then, C are the classes that we have in the dataset, in this case, are the classes of the sentiment analysis (0,2,4).

This training model is in charge of detecting the different classes of the tweets found in the dataset, in this case, the classes refer to the sentiment analysis levels assigned to each tweet. Once this is done, it generates a vocabulary to be stored and tested later on. 

In [11]:
def train_naive_bayes(D, C):
    # initialize logprior, loglikelihood
    logprior = {}
    loglikelihood = {}
    V = set()
    
    # for each class c in C
    for c in C:
        if c not in loglikelihood:
            loglikelihood[c] = {}
        N_doc = len(D)
        #number of documents from D in class C
        N_c = 0

        # for each document d in D
        for d in D:
            # if document d is in class c
            if d[0] == c:
                # increment N_c
                N_c += 1
                # for each word w in d
                #for w in d[1]:
                    #print(w)
                    #w.split(' ')
                    # add word w to V
                w = d[1]
                w = w.split(' ')
                for word in w:
                    V.add(word)
                    # if word w is not in loglikelihood[c]
                    if word not in loglikelihood[c]:
                        # add word w to loglikelihood[c]
                        loglikelihood[c][word] = 0
                    # increment loglikelihood[c][w]
                    loglikelihood[c][word] += 1
        # compute logprior[c]
        logprior[c] = np.log(N_c/N_doc)

        # for each word w in V
        for word in V:
            # compute loglikelihood[c][w]
            if word not in loglikelihood[c]:
                loglikelihood[c][word] = np.log((1)/(len(V) + 1))
            else:
                loglikelihood[c][word] = np.log((loglikelihood[c][word] + 1)/(len(V) + 1))
    return logprior, loglikelihood, V

In [12]:
logprior, loglikelihood, V = train_naive_bayes(D, C)

In [13]:
logprior

{0: -0.9660141672265856, 2: -1.4165997106152195, 4: -0.9758664636695973}

#### Naive Bayes test model
Once the training is done, we apply the Naive Bayes testing model. This will be in charge of classifying the classes previously added in the training and counting the number of classes in the training. Our classes are 0, 2 and 4. The 0 class is the negative sentiment, the 2 class is the neutral sentiment and the 4 class is the positive sentiment. Finally, we create the variable "test_doc" to store the test data.

In [14]:
def testing_naive_bayes(testdoc, logprior, loglikelihood, C,V):
    #output array containing the probability for each class
    suma = np.zeros(len(C))# [0,0,0]
    # for each class c in C
    for i, c in enumerate(C):#[0,2,4]
        # compute logprior[c]
        suma[i] = logprior[c]
        # for each word w in testdoc
        w = testdoc.split(' ')
        for word in w:
            # if w is in V
            if word in V:
                # compute loglikelihood[c][w]
                if word in loglikelihood[c]:
                    
                    suma[i] += loglikelihood[c][word]
                else:
                    suma[i] = 1
                
    return suma

In [15]:
test_doc = test['cleaning_tweets'].to_numpy().tolist()
test_values = test['senti'].to_numpy().tolist()

#### Naive Bayes prediction model

The "labeling" function takes the previously created array and chooses which one has the highest value and generates a label depending on its position. The three positions that our model is 0, 2 and 4. 

In [16]:
def labeling(array):
    array = [-500 if element == 1 else element for element in array]
    array = list(array)
    maximum = max(array)
    pos = array.index(maximum)
    return [0, 2, 4][pos]

In [17]:
labels_predicted = []
for element in test_doc:
    naive = testing_naive_bayes(element, logprior, loglikelihood, C, V)
    labels_predicted.append(labeling(naive))

#### Accuracy, precision, and recall
After the training and testing of the model, we calculate the accuracy of the model. The accuracy is the number of correct predictions divided by the total number of predictions. The accuracy of our model is 0.66. Our recall is 0.65 and our precision is 0.72. Our results are not very good, but we think that it is because we are using a small dataset. We think that if we use a bigger dataset, our results will be better. 

In [23]:
accuracy_score(test_values, labels_predicted)
print("The accuracy of the model is: ", accuracy_score(test_values, labels_predicted))

The accuracy of the model is:  0.6625


In [24]:
recall_score(test_values, labels_predicted, average = 'macro')
print("The recall of the model is: ", recall_score(test_values, labels_predicted, average = 'macro'))

The recall of the model is:  0.6541847041847042


In [25]:
precision_score(test_values, labels_predicted, average = 'macro')
print("The precision of the model is: ", precision_score(test_values, labels_predicted, average = 'macro'))

The precision of the model is:  0.726890756302521


#### Confusion matrix
Finally, we have the confusion matrix. This matrix is in charge of showing the number of correct and incorrect predictions. The diagonal of the matrix shows the number of correct predictions. The rest of the matrix shows the number of incorrect predictions. The confusion matrix shows that the model has a good performance in the 0 class, but it has a bad performance in the 2 and 4 classes. We think that this is because we are using a small dataset. We think that if we use a bigger dataset, our results will be better.

In [27]:
#confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(test_values, labels_predicted)



array([[28,  0,  2],
       [ 7, 14,  1],
       [14,  3, 11]], dtype=int64)