# Sentiment Classification: Preprocessing Part 1
Date: 2021-02-16
Tags: 'NLP','Data','Python'
Author: Vincent
<!--eofm-->

In this series of posts we're going to build a text classification model that predicts whether a comment expresses a negative or positive stance on COVID-19 vaccination. This short first installment will deal with the way in which we digest the raw comments into something a bit more workable before doing any machine-learn-y things.

### Cleaning up the comments

The data we have on hand are roughly 9,000 scraped comments from various sources such as YouTube, Reddit and Twitter. Each comment has first been labeled individually as either 1 or 0, expressing a positive or negative stance respectively. All comments have after that also independently been labeled by at least one other annotator, as we will see. Let's start up `pandas`, our data handling package of choice, and inspect some rows!


In [32]:
import pandas as pd

# These comments are pretty long
pd.set_option('display.max_colwidth', None)

# Technically not a csv but pandas don't mind
df = pd.read_csv('vacc_train_data.tsv', sep='\t', names=['target','comment'])

df.head(10)

Unnamed: 0,target,comment
0,0/-1,It is easier to fool a million people than it is to convince a million people that they have been fooled. - Mark Twain
1,0/0,NATURAL IMMUNITY protected us since evolution. Do not exist anymore?
2,0/-1,"NATURAL IMMUNITY protected us since evolution. Do not exist anymore? ? ? No one talks about it, Why? ?"
3,1/-1,The bigest sideffect of vaccines is fewer dead children That is savage
4,0/-1,"90% of people that get vaccinated don't get the virus Wooow what's in the vaccine then? I'm very pro vaccination, but in my opinion Covid 19 vaccine is just sweetened water."
5,1/1,95.6% effective against the original strain and 85.6% effective against the variant. Excellent news. Every positive news during this pandemic is to be welcomed (no pun intended)
6,0/0,Appears Safe... yeh think I will pass.
7,1/1,Both NHS workers have a history of serious allergies and carry adrenaline pens around with them. So 2 people had a reaction out of thousands and they were already prone to them. I will still be getting the jab when it is offered to me.Thank you!
8,0/0,COVID arm is a new rash-like side effect appearing among some people who've received Moderna's COVID-19 vaccine.
9,1/1,"Doctors don't know anything about vaccines?!? What is she fucking talking about?? That's like saying that Mechanics don't know anything about carburators! The fact that she said that implies that she knows more about vaccines than actual doctors. That her 20-30 minutes of ""research"" on google finding the answers she wanted is in some way a better education on vaccines than years in college studying in the field of medicine. Wow."


The goal of our preprocessing steps here is to homogenize the text data a bit. We will later in our first model only consider the words in a given sentence, without regarding the order in which they were written. To this end, punctuation becomes very irrelevant so we will just strip it, and we will also make sure that everything is in lower case. Website URLs also seem reasonable to remove.

### Emojis

Special characters that however are of interest are emojis. Consider the following three fictitious comments:

 1. `Vaccine = ☠`
 2. `vaccine 😀` 
 3. `💉🤮🤮`

Stripping the emojis from these comments removes pretty important information.
Luckily for us, all emojis have descriptions and we can just translate these using the `demoji` package! We can now write the text cleaning function. We will make use of some regexp-stuff, just roll with the punches if this looks weird.

In [30]:
import re, string
import demoji
# demoji.download_codes()

def textify_emojis(text):
    # Returns a dictionary with 'emoji' : 'description' pairs
    emojis = demoji.findall(text)

    # We have to slightly modify the descriptions
    # by replacing all special characters with dashes
    # and then surrounding them with ":" and whitespace
    for emoji, desc in emojis.items():
        desc = ' :'+re.sub(r"[^0-9a-zA-Z]+", "-", desc )+': '
        text = re.sub(emoji, desc, text)

    return text

# Function that returns nice and clean text
def clean_text(text):
    # Make all text lowercase
    text = text.lower()

    # Remove links
    text = re.sub('https?://\S+|www\.\S+', '', text)

    # Replace newline with space
    text = re.sub('\n', ' ', text)

    # Remove all ":" and "-"-characters first, we will 
    # use these to represent emojis
    text = re.sub(r"[:\-]", '', text)

    # Replace emojis with text descriptions
    text = textify_emojis(text)

    # Remove most non-alphanumerical characters
    text = re.sub(r"[^0-9a-zA-Z%:\- ]+", "", text)

    # Get rid of unnecessary whitespace
    text = ' '.join(text.split())

    # Done!
    return text    

text = "💉🤮🤮"
print(text, '--->', clean_text(text))

💉🤮🤮 ---> :syringe: :face-vomiting: :face-vomiting:


Nice! Now we apply this to the whole dataset.

In [33]:
df['comment'] = df['comment'].apply(lambda x: clean_text(x))

df.head(10)

Unnamed: 0,target,comment
0,0/-1,it is easier to fool a million people than it is to convince a million people that they have been fooled mark twain
1,0/0,natural immunity protected us since evolution do not exist anymore
2,0/-1,natural immunity protected us since evolution do not exist anymore no one talks about it why
3,1/-1,the bigest sideffect of vaccines is fewer dead children that is savage
4,0/-1,90% of people that get vaccinated dont get the virus wooow whats in the vaccine then im very pro vaccination but in my opinion covid 19 vaccine is just sweetened water
5,1/1,956% effective against the original strain and 856% effective against the variant excellent news every positive news during this pandemic is to be welcomed no pun intended
6,0/0,appears safe yeh think i will pass
7,1/1,both nhs workers have a history of serious allergies and carry adrenaline pens around with them so 2 people had a reaction out of thousands and they were already prone to them i will still be getting the jab when it is offered to methank you
8,0/0,covid arm is a new rashlike side effect appearing among some people whove received modernas covid19 vaccine
9,1/1,doctors dont know anything about vaccines what is she fucking talking about thats like saying that mechanics dont know anything about carburators the fact that she said that implies that she knows more about vaccines than actual doctors that her 2030 minutes of research on google finding the answers she wanted is in some way a better education on vaccines than years in college studying in the field of medicine wow


Okay, the percentages look a bit wonky, and we didn't see any example with an encoded emoji. Looking at the raw data, row 36 contains a comment with an emoji.

In [41]:
print(df.loc[35,'comment'])

trust the government :rolling-on-the-floor-laughing:


We might do something about the percentages later, but that's all for now!