<a href="https://colab.research.google.com/github/sergiobm3/ESI_MachineLearning/blob/NLP/PREPROCESSING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0. Introduction

Before starting to work with the data, it is important to preprocess it, so that later we can use this data more efficiently.

Our recommendation is that you do not try to run the notebook, because the **Correct Wrong Words** section involves a fairly high execution time. That is why we provide in a CSV the result of executing this notebook, with name *preprocessed_data.csv*

## Libraries

In [None]:
import io
import pandas as pd
from google.colab import files
import re

# Libraries for natural language processing
import nltk

# Libraries for tweets
!pip install tweet-preprocessor
import preprocessor as p
from nltk.tokenize import TweetTokenizer

# Libraries for correct wrong words
from collections import Counter

# Libraries for emoticons
!pip install emoji
from emoji import UNICODE_EMOJI

# Libraries for stopwords
from nltk.corpus import stopwords

# Libraries for lemmatizer
from nltk.stem import WordNetLemmatizer

#For tweets
nltk.download('punkt')

# For stopwords
nltk.download('stopwords')

# For lemmatizer
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

# 1. Loading Data

We start by loading the data file that contains the tweets with which the study will work. The file format to upload is CSV.

You must have the *labeled_data.csv* file loaded to run this piece of code.

In [None]:
df = pd.read_csv('./labeled_data.csv', sep=',')
df

Unnamed: 0.1,Unnamed: 0,count,hate_speech,offensive_language,neither,class,tweet
0,0,3,0,0,3,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,3,0,3,0,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,2,3,0,3,0,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,3,3,0,2,1,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,4,6,0,6,0,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...
...,...,...,...,...,...,...,...
24778,25291,3,0,2,1,1,you's a muthaf***in lie &#8220;@LifeAsKing: @2...
24779,25292,3,0,1,2,2,"you've gone and broke the wrong heart baby, an..."
24780,25294,3,0,3,0,1,young buck wanna eat!!.. dat nigguh like I ain...
24781,25295,6,0,6,0,1,youu got wild bitches tellin you lies


# 2. Preprocessing

Before starting, it is necessary to preprocess the tweet field of the dataset that is provided to us, in order to work in a more efficient and accurate way.

## Remove unuseful data

Some symbols that do not provide information have been eliminated.

In [None]:
def cleanUnusefulData(sentence):
  sentence = sentence.translate(str.maketrans('', '', pattern))
  return sentence

def removeUnusefulExclamation(sentence):
  if sentence[0] == '!':
    sentence = removeUnusefulExclamation(sentence[1:])
  else:
    return sentence
  return sentence

clean_tweet = []
for t in range(0,len(df)):
  tweet = df.iloc[t]['tweet']
  tweet = removeUnusefulExclamation(tweet)
  clean_tweet.append(tweet)

df['clean_tweet'] = clean_tweet


## Replace emoticons

We replace the emoticons with words that resemble what they want to express. You must upload the emoticons.txt file.

In our dataframe a column will be maintained that if it contains the tweets with their emoticons, the reason is that we think that perhaps in the future they can be used and provide information.

In [None]:
dict_emoticons = {}
with open('emoticons.txt') as f:
    for linea in f:
      info = linea.split(",")
      emoticon_code = info[0]
      emoticon_word = info[2]
      dict_emoticons[emoticon_code] = emoticon_word[:len(emoticon_word)-1]
print(dict_emoticons)

def replaceEmoticon(word):
  if word in dict_emoticons:
    word = word.replace(str(word),dict_emoticons[word])
  return word

def replaceUnknownEmoticon(word):
  m = re.match(r"\\[u][A-Za-z0-9]*",word)
  if m is not None:
    word = ""
  return word

{'😀': 'smile', '😁': 'smile', '😂': 'laugh', '😃': 'smile', '😄': 'smile', '😆': 'smile', '😉': 'wink', '😊': 'smile', '😒': 'unamused', '😕': 'confused', '😗': 'kiss', '😘': 'kiss', '😙': 'kiss', '😚': 'kiss', '😞': 'dissapointed', '😟': 'worried', '😠': 'angry', '😡': 'angry', '😢': 'sad', '😨': 'frightened', '😪': 'sleepy', '😫': 'tired', '😭': 'sad', '😮': 'surprised', '😯': 'surprised', '😱': 'frightened', '😲': 'astonished', '😳': 'flushed', '😴': 'sleepy', '😵': 'confused', '😶': 'quiet', '🤐': 'quiet', '🤒': 'ill', '🤔': 'thoughtful', '🤡': 'clown', '🤢': 'sucks', '🤣': 'laugh', '\U0001f92c': 'angry', '\U0001f92e': 'sucks', '\U0001f92f': 'astonished', '🍑': 'ass', '🍒': 'tits', '🍌': 'dick', '🍆': 'dick', '👉': 'finger', '👌': 'ok'}


## Remove contractions
 
A dictionary containing the most common contractions has been implemented, to return instead the "expanded form" of expressing them.

In [None]:
dict_contractions = {'aren\'t':'are not', 'can\'t':'can not', 'couldn\'t':'could not', 'didn\'t':'did not', 'don\'t':'do not', 'doesn\'t':'does not', 'hadn\'t':'had not',
                       'haven\'t':'have not', 'he\'s':'he is', 'he\'ll':'he will', 'he\'d':'he would', 'here\'s':'here is', 'i\'m':'i am', 'i\'ve':'i have', 'i\'ll':'i will',
                       'i\'d':'i would', 'isn\'t':'is not','it\'s':'it is', 'it\'ll':'it will', 'mustn\'t':'must not', 'she\'s':'she is', 'she\'ll':'she will', 'she\'d':'she would',
                       'shouldn\'t':'should not', 'that\'s':'that is', 'there\'s':'there is', 'they\'re':'they are', 'they\'ve':'they have', 'they\'ll':'they will', 'they\'d':'they would',
                       'they\'d':'they had', 'wasn\'t':'was not', 'we\'re':'we are', 'we\'re':'we are', 'we\'ve':'we have', 'we\'ll':'we will', 'we\'d':'we would', 'weren\'t':'were not', 'what\'s':'what is',
                       'where\'s':'where is', 'who\'s':'who is', 'who\'ll':'who will', 'won\'t':'will not', 'wouldn\'t':'would not', 'you\'re': 'you are', 'you\'ve': 'you have', 
                       'you\'ll':'you will', 'you\'d':'you would', 'y\'all': 'you all', 'could\'ve': 'could have', 'hasn\'t': 'has not', 'let\'s': 'let us'}

def remove_contractions(word):
  return word.replace(word,dict_contractions[word])
  
def is_remove_contractions(word):
  if word in dict_contractions:
    return True    
  else:
    return False

## Executing preprocessing

All the functions defined above will be executed. And other preprocessing steps that have not needed to be defined in a separate function, such as deleting all capital letters.

In [None]:
# We create a new dataframe to save result in different columns
df_result = pd.DataFrame()

In [None]:
pattern = "\"#$%&'()*+, -./:;<=>?@[\]^_`{|}~“”…»’!"
p.set_options(p.OPT.EMOJI, p.OPT.URL,p.OPT.HASHTAG, p.OPT.MENTION, p.OPT.SMILEY, p.OPT.ESCAPE_CHAR, p.OPT.RESERVED, p.OPT.NUMBER)
tknzr = TweetTokenizer(strip_handles=True, reduce_len=True)

###### EXECUTING PREPROCESSING WITHOUT EMOTICONS ######

# Each sentence of tweet
sentences = []
for tweet in df['clean_tweet']:
    result = tknzr.tokenize(tweet)
    list_token = []
    for word in result:
      # Change emoticon for text
      token = replaceEmoticon(word)
      token = replaceUnknownEmoticon(token.encode('unicode-escape').decode('ASCII'))

      if token == "": 
        pass
      else: token = token.encode('ASCII').decode('unicode-escape')

      # Clean hashtag and mentions
      token = p.clean(token)

      # Remove the capital letters, if the word is not capitalized entirely.
      token = token.lower()

      if is_remove_contractions(token): # Remove contractions
        token = remove_contractions(token)
        token = token.split(" ")
        list_token.append(token[0])
        list_token.append(token[1])
      else: # Clean unuseful data (",_,...)
        token = cleanUnusefulData(token)
        if(token is not ""):
          list_token.append(token)
      
    sentences.append(list_token)

df_result['preprocessing_without_emoticons'] = sentences
df_result

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,preprocessing_without_emoticons
0,"[as, a, woman, you, should, not, complain, abo..."
1,"[boy, dats, cold, tyga, dwn, bad, for, cuffin,..."
2,"[dawg, you, ever, fuck, a, bitch, and, she, st..."
3,"[she, look, like, a, tranny]"
4,"[the, shit, you, hear, about, me, might, be, t..."
...,...
24778,"[yous, a, muthaf, in, lie, right, his, tl, is,..."
24779,"[you, have, gone, and, broke, the, wrong, hear..."
24780,"[young, buck, wanna, eat, dat, nigguh, like, i..."
24781,"[youu, got, wild, bitches, tellin, you, lies]"


We include an additional column with emoticons and the '!' since we have seen that it is given importance in Vader's sentiment analysis

In [None]:
###### EXECUTING PREPROCESSING WITH EMOTICONS ######

pattern = "\"#$%&'()*+, -./:;<=>?@[\]^_`{|}~“”…»’"
p.set_options(p.OPT.URL,p.OPT.HASHTAG, p.OPT.MENTION, p.OPT.SMILEY, p.OPT.ESCAPE_CHAR, p.OPT.RESERVED, p.OPT.NUMBER)

# Each sentence of tweet
sentences = []
for tweet in df['clean_tweet']:
    result = tknzr.tokenize(tweet)
    list_token = []
    for word in result:
      # Change emoticon for text
      token = p.clean(word)
      token = replaceUnknownEmoticon(token.encode('unicode-escape').decode('ASCII'))

      if token == "":
        pass
      else: token = token.encode('ASCII').decode('unicode-escape')

      # Remove the capital letters, if the word is not capitalized entirely.
      if not token.isupper():
        token = token.lower()
      if token.isupper() and len(token)==1:
        token = token.lower()

      if is_remove_contractions(token): # Remove contractions
        token = remove_contractions(token)
        token = token.split(" ")
        list_token.append(token[0])
        list_token.append(token[1])
      else: # Clean unuseful data (",_,...)
        token = cleanUnusefulData(token)
        if(token is not ""):
          list_token.append(token)
    
    sentences.append(list_token)

df_result['preprocessing_with_emoticons'] = sentences
df_result

Unnamed: 0,preprocessing_without_emoticons,preprocessing_with_emoticons
0,"[as, a, woman, you, should, not, complain, abo...","[as, a, woman, you, should, not, complain, abo..."
1,"[boy, dats, cold, tyga, dwn, bad, for, cuffin,...","[boy, dats, cold, tyga, dwn, bad, for, cuffin,..."
2,"[dawg, you, ever, fuck, a, bitch, and, she, st...","[dawg, !, !, !, you, ever, fuck, a, bitch, and..."
3,"[she, look, like, a, tranny]","[she, look, like, a, tranny]"
4,"[the, shit, you, hear, about, me, might, be, t...","[the, shit, you, hear, about, me, might, be, t..."
...,...,...
24778,"[yous, a, muthaf, in, lie, right, his, tl, is,...","[yous, a, muthaf, in, lie, right, !, his, TL, ..."
24779,"[you, have, gone, and, broke, the, wrong, hear...","[you, have, gone, and, broke, the, wrong, hear..."
24780,"[young, buck, wanna, eat, dat, nigguh, like, i...","[young, buck, wanna, eat, !, !, dat, nigguh, l..."
24781,"[youu, got, wild, bitches, tellin, you, lies]","[youu, got, wild, bitches, tellin, you, lies]"


## Correct Wrong Words

With the following functions it is possible to implement a word corrector, which will use the big.txt file to correct the words of the tweets. Therefore, the big.txt file needs to be loaded.

In [None]:
def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('big.txt').read()))

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

In [None]:
def is_emoji(s):
    count = 0
    for emoji in UNICODE_EMOJI:
        count += s.count(emoji)
        if count > 1:
            return False
    return bool(count)



In [None]:
bad_words = []
with open('bad_words.txt') as f:
    for linea in f:
      bad_words.append(linea[:len(linea)-1])
print(bad_words)

['2g1c', '2 girls 1 cup', 'acrotomophilia', 'alabama hot pocket', 'alaskan pipeline', 'anal', 'anilingus', 'anus', 'apeshit', 'arsehole', 'ass', 'asshole', 'assmunch', 'auto erotic', 'autoerotic', 'babeland', 'baby batter', 'baby juice', 'ball gag', 'ball gravy', 'ball kicking', 'ball licking', 'ball sack', 'ball sucking', 'bangbros', 'bangbus', 'bareback', 'barely legal', 'barenaked', 'bastard', 'bastardo', 'bastinado', 'bbw', 'bdsm', 'beaner', 'beaners', 'beaver cleaver', 'beaver lips', 'beastiality', 'bestiality', 'big black', 'big breasts', 'big knockers', 'big tits', 'bimbos', 'birdlock', 'bitch', 'bitches', 'black cock', 'blonde action', 'blonde on blonde action', 'blowjob', 'blow job', 'blow your load', 'blue waffle', 'blumpkin', 'bollocks', 'bondage', 'boner', 'boob', 'boobs', 'booty call', 'brown showers', 'brunette action', 'bukkake', 'bulldyke', 'bullet vibe', 'bullshit', 'bung hole', 'bunghole', 'busty', 'butt', 'buttcheeks', 'butthole', 'camel toe', 'camgirl', 'camslut', '

⚠️⚠️ **¡¡¡ HIGH EXECUTION TIME !!!** ⚠️⚠️

Using the following piece of code, we correct all the words in the tweets. However, the execution time is approximately 30 minutes. 

In [None]:
df_correct_words = pd.DataFrame(columns=['preprocessing_with_emoticons','preprocessing_without_emoticons'])

In [None]:
########## ⚠️ HIGH EXECUTION TIME ⚠️ ##########

listTweetsWithout = []
for i in range(0,len(df_result['preprocessing_without_emoticons'])):
  listTweet = []
  for j in range(0,len(df_result['preprocessing_without_emoticons'][i])):
    if df_result['preprocessing_without_emoticons'][i][j] not in bad_words:
      listTweet.append(correction(str(df_result['preprocessing_without_emoticons'][i][j])))
    else:
      listTweet.append(df_result['preprocessing_without_emoticons'][i][j])
  listTweetsWithout.append(listTweet)

df_correct_words['preprocessing_without_emoticons'] = listTweetsWithout

In [None]:
########## ⚠️ HIGH EXECUTION TIME ⚠️ ##########
listTweetsWith = []
for i in range(0,len(df_result['preprocessing_with_emoticons'])):
  listTweet = []
  for j in range(0,len(df_result['preprocessing_with_emoticons'][i])):
    if df_result['preprocessing_with_emoticons'][i][j] not in bad_words:
      if is_emoji(str(df_result['preprocessing_with_emoticons'][i][j])) or df_result['preprocessing_with_emoticons'][i][j]=='!':
        listTweet.append(str(df_result['preprocessing_with_emoticons'][i][j]))
      else:
        listTweet.append(correction(str(df_result['preprocessing_with_emoticons'][i][j])))
    else:
      listTweet.append(df_result['preprocessing_with_emoticons'][i][j])
  listTweetsWith.append(listTweet)


df_correct_words['preprocessing_with_emoticons'] = listTweetsWith

In [None]:
df_correct_words

## Stop words

A stop word is a word in common use, which does not provide information. It does not influence when expressing feelings of hatred. Therefore, we eliminate them.

In [None]:
english_stops = set(stopwords.words('english'))
clear_sent_emoticon = []
clear_sent_not_emoticon = []
for tweet in range(0,len(df_result['preprocessing_with_emoticons'])):
  clear_sent_emoticon.append([word for word in df_correct_words['preprocessing_with_emoticons'][tweet] if word not in english_stops]) 
  clear_sent_not_emoticon.append([word for word in df_correct_words['preprocessing_without_emoticons'][tweet] if word not in english_stops]) 

df_result['preprocessing_with_emoticons'] = clear_sent_emoticon
df_result['preprocessing_without_emoticons'] = clear_sent_not_emoticon
df_result

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,preprocessing_without_emoticons,preprocessing_with_emoticons
0,"[woman, complain, cleaning, house, man, always...","[woman, complain, cleaning, house, man, always..."
1,"[boy, days, cold, tea, bad, coffin, dat, st, p...","[boy, days, cold, tea, bad, coffin, dat, st, p..."
2,"[dawn, ever, fuck, bitch, start, cry, confused...","[dawn, !, !, !, ever, fuck, bitch, start, cry,..."
3,"[look, like, tranny]","[look, like, tranny]"
4,"[shit, hear, might, true, might, baker, bitch,...","[shit, hear, might, true, might, baker, bitch,..."
...,...,...
24778,"[mutual, lie, right, trash, mine, bible, scrip...","[mutual, lie, right, !, trash, mine, bible, sc..."
24779,"[gone, broke, wrong, heart, baby, drove, redne...","[gone, broke, wrong, heart, baby, drove, redne..."
24780,"[young, buck, anna, eat, dat, nigh, like, aunt...","[young, buck, anna, eat, !, !, dat, nigh, like..."
24781,"[got, wild, bitches, telling, lies]","[got, wild, bitches, telling, lies]"


## Lemmatize all terms

With lemmatization we group the inflected forms of the different words so that they can be analyzed as a single element.

In [None]:
lemmatizer = WordNetLemmatizer()
clear_sent_emoticon = []
clear_sent_not_emoticon = []
for tweet in range(0,len(df_result['preprocessing_with_emoticons'])):
  clear_sent_emoticon.append([lemmatizer.lemmatize(word) for word in df_correct_words['preprocessing_with_emoticons'][tweet]])
  clear_sent_not_emoticon.append([lemmatizer.lemmatize(word) for word in df_correct_words['preprocessing_without_emoticons'][tweet]])

df_result['preprocessing_with_emoticons'] = clear_sent_emoticon
df_result['preprocessing_without_emoticons'] = clear_sent_not_emoticon
df_result

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


Unnamed: 0,preprocessing_without_emoticons,preprocessing_with_emoticons
0,"[a, a, woman, you, should, not, complain, abou...","[a, a, woman, you, should, not, complain, abou..."
1,"[boy, day, cold, tea, down, bad, for, coffin, ...","[boy, day, cold, tea, down, bad, for, coffin, ..."
2,"[dawn, you, ever, fuck, a, bitch, and, she, st...","[dawn, !, !, !, you, ever, fuck, a, bitch, and..."
3,"[she, look, like, a, tranny]","[she, look, like, a, tranny]"
4,"[the, shit, you, hear, about, me, might, be, t...","[the, shit, you, hear, about, me, might, be, t..."
...,...,...
24778,"[you, a, mutual, in, lie, right, his, to, is, ...","[you, a, mutual, in, lie, right, !, his, of, i..."
24779,"[you, have, gone, and, broke, the, wrong, hear...","[you, have, gone, and, broke, the, wrong, hear..."
24780,"[young, buck, anna, eat, dat, nigh, like, i, a...","[young, buck, anna, eat, !, !, dat, nigh, like..."
24781,"[you, got, wild, bitch, telling, you, lie]","[you, got, wild, bitch, telling, you, lie]"


# 3. Export the data

Finally we save the preprocessed information in a csv to work more comfortably in another notebook.

In [None]:
df_result['class'] = df['class']
df_result.to_csv("prepocessed_data.csv")