# Preprocessing for NLP Modeling

Here are the 4 datasets that I cleaned a bit and combined in the EDA & Cleaning notebook.

1. Airline Tweet Sentiments dataset | https://data.world/crowdflower/airline-twitter-sentiment

2. Apple Tweet Sentiment dataset | https://data.world/crowdflower/apple-twitter-sentiment

3. Sentiment 140 dataset | https://www.kaggle.com/datasets/kazanova/sentiment140/

4. Tweet 4 Sentiment Analysis dataset | http://www.t4sa.it

## Imports

In [2]:
import boto3
import pandas as pd
import numpy as np

# To ignore SetOnCopyWarning
pd.options.mode.chained_assignment = None

# ntlk stuff
import nltk
from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer, SnowballStemmer

# Regex
import re

# for vizzies
import matplotlib.pyplot as plt
import seaborn as sns

## Load Data

In [3]:
# Load up data
df = pd.read_csv('../data/combined_data.csv', index_col='Unnamed: 0')

df.head()

  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,sentiment,tweet
0,1.0,@VirginAmerica What @dhepburn said.
1,2.0,@VirginAmerica plus you've added commercials t...
2,1.0,@VirginAmerica I didn't today... Must mean I n...
3,0.0,@VirginAmerica it's really aggressive to blast...
4,0.0,@VirginAmerica and it's a really big bad thing...


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2798402 entries, 0 to 2798400
Data columns (total 2 columns):
 #   Column     Dtype  
---  ------     -----  
 0   sentiment  float64
 1   tweet      object 
dtypes: float64(1), object(1)
memory usage: 64.1+ MB


In [5]:
df.sentiment.value_counts(dropna=False)

2.0    1174127
0.0     989447
1.0     634827
NaN          1
Name: sentiment, dtype: int64

In [6]:
df[df['sentiment'].isna() == True]

Unnamed: 0,sentiment,tweet
Maybe @Charter or @Apple #AntiDrama @FoxNews,,


In [7]:
df.dropna(inplace=True)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2798401 entries, 0 to 2798400
Data columns (total 2 columns):
 #   Column     Dtype  
---  ------     -----  
 0   sentiment  float64
 1   tweet      object 
dtypes: float64(1), object(1)
memory usage: 64.1+ MB


In [9]:
df.sentiment.value_counts(dropna=False)

2.0    1174127
0.0     989447
1.0     634827
Name: sentiment, dtype: int64

Okay, let's get pre-processing

## Build a Processing Function and Apply It

Let's build a function to clean up the content of each tweet, tokenize it, filter out stopwords, and stem the tokens.

We will need to add a few special steps in this function for tweet processing specifically. Namely, we will want to remove user mentions (@'s), remove the hash symbol and separate the words in hashtags and remove URL's. The below tweet will be our example tweet to work with.

In [10]:
df.tweet.iloc[112]

'@VirginAmerica has getaway deals through May, from $59 one-way. Lots of cool cities http://t.co/QDlJHslOI5 #CheapFlights #FareCompare'

In [11]:
def tokenize_tweet(tweet, tokenizer, stopwords_list):
    # Remove @mentions from tweet
    tweet = re.sub(r'@\w+', '', tweet)
    
    # Clean hashtags by removing the hash symbol and separating words by capital letters
    tweet = re.sub(r'#(\w+)', lambda x: ' '.join(re.findall(r'[A-Z]?[a-z]+', x.group(1))), tweet)
    
    # Remove URL's
    tweet = re.sub(r'https?://\S+|www\.\S+', '', tweet)
    
    # Standardize case (lowercase the text)
    tweet = tweet.lower()
    
    # Tokenize text using RegEx
    tokens = tokenizer.tokenize(tweet)
    
    # Remove stopwords
    filtered_tokens = [word for word in tokens if word not in stopwords_list]
    
    # Return the preprocessed text
    return filtered_tokens

In [12]:
tokenizer = RegexpTokenizer(r"[a-zA-Z]+(?:'[a-z]+)?")
stopwords_list = stopwords.words('english')

# Test function
tokenize_tweet(df.tweet.iloc[112], tokenizer, stopwords_list)

['getaway',
 'deals',
 'may',
 'one',
 'way',
 'lots',
 'cool',
 'cities',
 'cheap',
 'flights',
 'fare',
 'compare']

Perfect. Now let's cross our fingers this function won't break our kernel when we apply it to the entire dataframe...

In [13]:
df['tokenized'] = df.tweet.apply(lambda x: tokenize_tweet(x, tokenizer, stopwords_list))

df.head()

Unnamed: 0,sentiment,tweet,tokenized
0,1.0,@VirginAmerica What @dhepburn said.,[said]
1,2.0,@VirginAmerica plus you've added commercials t...,"[plus, added, commercials, experience, tacky]"
2,1.0,@VirginAmerica I didn't today... Must mean I n...,"[today, must, mean, need, take, another, trip]"
3,0.0,@VirginAmerica it's really aggressive to blast...,"[really, aggressive, blast, obnoxious, enterta..."
4,0.0,@VirginAmerica and it's a really big bad thing...,"[really, big, bad, thing]"


## Re-Join Tokenized Words and Check for Duplicates / NaN's

Next we need to rejoin our tokens, as our vectorizer is expecting a single string of tokens.

We will also check for and drop duplicates and NaN's. 

In [14]:
# Joins tokens into single string
df['text'] = df.tokenized.apply(lambda x: " ".join(x))

In [15]:
# Keep only our target and text data for saving down
df = df[['sentiment', 'text']]
df.head()

Unnamed: 0,sentiment,text
0,1.0,said
1,2.0,plus added commercials experience tacky
2,1.0,today must mean need take another trip
3,0.0,really aggressive blast obnoxious entertainmen...
4,0.0,really big bad thing


In [16]:
df.shape

(2798401, 2)

In [17]:
df.duplicated(subset='text').value_counts()

False    2294858
True      503543
dtype: int64

Okay, definitely need to drop those duplicates.

In [18]:
df.drop_duplicates(subset='text', inplace=True)

In [19]:
df.shape

(2294858, 2)

In [20]:
df.text.isna().value_counts()

False    2294858
Name: text, dtype: int64

Okay, so no NaN's apparently. Let's save this down and get modeling!

## Save Down

In [21]:
df.to_csv('../data/processed_data.csv')

We'll also want to save down a processing function to use on unseen tweets later when we deploy our model. Let's redefine one with the final join step and save it.

In [22]:
def preprocess_tweet(tweet, tokenizer, stopwords_list):
    # Remove @mentions from tweet
    tweet = re.sub(r'@\w+', '', tweet)
    
    # Clean hashtags by removing the hash symbol and separating words by capital letters
    tweet = re.sub(r'#(\w+)', lambda x: ' '.join(re.findall(r'[A-Z]?[a-z]+', x.group(1))), tweet)
    
    # Remove URL's
    tweet = re.sub(r'https?://\S+|www\.\S+', '', tweet)
    
    # Standardize case (lowercase the text)
    tweet = tweet.lower()
    
    # Tokenize text using RegEx
    tokens = tokenizer.tokenize(tweet)
    
    # Remove stopwords
    filtered_tokens = [word for word in tokens if word not in stopwords_list]
    
    # Return the preprocessed text
    return (" ".join(filtered_tokens))

In [26]:
import dill as pickle
    
with open('preprocess_tweet.pkl', 'wb') as file:
    pickle.dump(preprocess_tweet,file)