# 2019 Canadian Election tweets
# OSEMN Step 2: Scrub
# Cleanup of Sentiment 140 dataset
# Correcting records for consistency

This notebook describes part of Step 2: Scrub of OSEMN methodology. It covers cleanup of Sentiment 140.

Cleanup plan (stage 1, correction for consistency):

1. Parse dates
2. Replace HTML character codes
3. Replace unicode characters
4. Remove erratic text
5. Remove links from tweets
6. Remove tweets with erratic text length
7. Parse hashtags from tweets
8. Parse user handles from tweets
9. Apply text preprocessor
10. Remove non-English unicode characters

## Import dependencies

In [1]:
import numpy as np
import pandas as pd
import re
import os
import sys
from time import time

In [2]:
sys.path.append('../../src')
from nlp_utils import preprocessor

In [3]:
data_dir = '../../data/sentiment140/'
os.listdir(data_dir)

['testdata.manual.2009.06.14.csv',
 'training.1600000.processed.noemoticon.csv',
 'sentiment140_train_nodup.csv',
 'sentiment140_train_cleaned.csv']

## Load Sentiment 140 dataset

In [4]:
t = time()
train_df = pd.read_csv(data_dir + 'training.1600000.processed.noemoticon.csv', 
                       encoding="ISO-8859-1", header=None)
elapsed = time() - t
train_df.columns = ['sentiment', 'ids', 'date', 'query', 'user', 'text']
print("----- DataFrame loaded"
      "\nin {0:.2f} seconds".format(elapsed) +
      "\nwith {0:,} rows\nand {1:,} columns"
      .format(train_df.shape[0], train_df.shape[1]) +
      "\n-- Column names:\n", train_df.columns)

----- DataFrame loaded
in 4.67 seconds
with 1,600,000 rows
and 6 columns
-- Column names:
 Index(['sentiment', 'ids', 'date', 'query', 'user', 'text'], dtype='object')


## Parse dates

In [5]:
t = time()
train_df['date'] = pd.to_datetime(train_df['date'])
elapsed = time() - t
print("Date was parsed. Took {0:,.2f} seconds ({1:,.2f} minutes)".format(elapsed, elapsed / 60))



Date was parsed. Took 367.43 seconds (6.12 minutes)


## Replace HTML character codes
 

### Replace '<3' with 'love':

In [6]:
train_df['text'] = train_df['text'].str.replace("&lt;3", 'love')
print("Done!")

Done!


### HTML character codes:

The following HTML character codes will be replaced with symbols:
* &quot;
* &amp;
* &lt;
* &gt;

In [7]:
# replacing HTML character codes with their ASCII equivalents
train_df['text'] = train_df['text'].str.replace("&quot;", '"')
train_df['text'] = train_df['text'].str.replace("&amp;", '&')
train_df['text'] = train_df['text'].str.replace("&lt;", '<')
train_df['text'] = train_df['text'].str.replace("&gt;", '>')
print("HTML character codes were replaced with their ASCII equivalent.")

HTML character codes were replaced with their ASCII equivalent.


## Replace unicode characters

In [8]:
# replacing unicode character codes with their ASCII equivalents
train_df['text'] = train_df['text'].str.replace("ï¿½", "'")
print("Unicode characters were replaced with their ASCII equivalent.")

Unicode characters were replaced with their ASCII equivalent.


## Remove erratic text

In [9]:
mask1 = train_df['text'].str.contains('¿')
train_df.loc[mask1, 'text'].head()

244842    @michichan ã?ã??ã?ã?§ã?ã?ã ã?ã?¿ã?¾ã?ã...
245862    Tháº¿ mÃ  chÆ°a báº¯t ÄÆ°á»£c con cÃ¡ nÃ o, t...
245941    @13th ÑÑÐ¾ Ñ?Ð¾Ð²Ñ?ÐµÐ¼ Ð¿Ð»Ð¾Ñ
Ð¾?  Ð° blue...
245949    má»t wa', thÃ´i mai lÃ m tiáº¿p, cÃ²n 3 chá»¯...
246160    hic, mÃ£i má»i cÃ i xong cÃ¡i giáº£i thuáº­t ...
Name: text, dtype: object

In [10]:
train_df = train_df[~mask1]
print("{0:,} records remaining in the DataFrame.".format(len(train_df)))

1,599,445 records remaining in the DataFrame.


## Remove links

In [11]:
mask = train_df['text'].str.contains('http')
print("{0:,} records contain 'http' in 'text'.\n".format(len(train_df[mask])))
for i in np.arange(10): print(train_df.loc[mask, 'text'].iloc[i])

70,135 records contain 'http' in 'text'.

@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D
@MissXu sorry! bed time came here (GMT+1)   http://is.gd/fNge
Broadband plan 'a massive broken promise' http://tinyurl.com/dcuc33 via www.diigo.com/~tautao Still waiting for broadband we are 
Why won't you show my location?!   http://twitpic.com/2y2es
Strider is a sick little puppy  http://apps.facebook.com/dogbook/profile/view/5248435
 Body Of Missing Northern Calif. Girl Found: Police have found the remains of a missing Northern California girl .. http://tr.im/imji
Emily will be glad when Mommy is done training at her new job. She misses her.  http://apps.facebook.com/dogbook/profile/view/6176014
Crazy wind today = no birding  http://ff.im/1XTTi
Check out my mug  http://www.erika-obscura.blogspot.com
http://twitpic.com/2y2wr - according to my bro, our new puppy had a poo fight and was covered in poop  (picture stolen from him)


In [12]:
t = time()
train_df['text'] = train_df['text'].apply(lambda text: re.sub(r'http\S+', '', text))
elapsed = time() - t
print("Links were removed from the column 'text' of the DataFrame! Took {0:,.2f} seconds ({1:,.2f} minutes)"
      .format(elapsed, elapsed / 60))

Links were removed from the column 'text' of the DataFrame! Took 3.31 seconds (0.06 minutes)


In [13]:
# subset all records that contain 'http' in the tweet 'text'
mask = train_df['text'].str.contains('http')
print("{0:,} records contain 'http' in 'text'.\n".format(len(train_df[mask])))

# if more than 10 records returned, print 10
len_to_print = 0
if len(train_df[mask]) > 10: len_to_print = 10
else: lin_to_print = len(train_df[mask])

for i in np.arange(len_to_print): print(train_df.loc[mask, 'text'].iloc[i])

17 records contain 'http' in 'text'.

might have to make some changes to ruby twitter .. it doesn't include headers coming back ..so no API count without a serperate http call 
"The Gmail gadget does not support the "Always use  grr doofes igoogle  will aber kein http nutzen........
sofranel.eu updated : new url management. I hope it will improve Google referencing...  But again some little problems with CSS  http ...
@PatchouliW 1&1 internet hosting sucks cos they only allow you to carry light weight apps and only internal apps so no sl http requests 
Wtf is http streaming??? I still want flash 10 for iPhone  ITS SUNNY WOOOOO..... I need skittles -.- taste the rainbow!!!
@holytshirt gutted, they block http flickr at work 
Can't get VLC http interface connected ;( now i've got to fysically move to my computer for play/pause  anyone tips? S not firewall afaik
(@JenniferEllenM)Went to see Bob Dylan last night, was amazin'  Going to work soon. I was put on till 13 for my first ever shift!

## Remove tweets with text > 150 characters

In [14]:
t = time()
# add a new column with length of strings in 'text' to the DataFrame with generic tweets
train_df['text_len'] = train_df['text'].str.len()
elapsed = time() - t
print("Column 'text_len' was added to the DataFrame. Took {0:,.2f} seconds ({1:,.2f} minutes)"
      .format(elapsed, elapsed / 60))

Column 'text_len' was added to the DataFrame. Took 0.64 seconds (0.01 minutes)


In [15]:
max_len = 150
mask1 = train_df['text_len'] > max_len
print("Tweets longer than {0} characters".format(max_len))
train_df.loc[mask1, ['user', 'text', 'text_len']]

Tweets longer than 150 characters


Unnamed: 0,user,text,text_len
245571,candiceccl,ä»²æä¸èª²adverse possessionï¼?ä½ä¹å?çc...,166
248243,moriator,@anhhung cÃ i moto4lin rá»i anh XÃ i ÄÆ°á»£...,156
258460,B6ah,@CrEaTiVe_B Ø§Ø®ØªØ¨Ø§Ø±Ù .. final freshman l...,181
258511,B6ah,@CrEaTiVe_B Ø¹ÙØ¯Ù ÙÙÙØ² 8Ø§ÙØµØ¨Ø­ Ù Ø...,154
265700,tukata,@pammanista à¸¨à¸²à¸¥à¸à¸£à¸°à¸ à¸¹à¸¡à¸´ na ...,151
...,...,...,...
1583033,im_nlfb,"@traquannet Chá»? tÃ­ nhÃ©, mÃ¬nh cÃ i tweetde...",170
1583052,5ummer,@manubkk @bkkdude Tks for sharing ka. But if ...,151
1586631,LaMiaVitaBella,@RawkerChick Currently obsessed with...WATERME...,151
1587593,kuturak,Ð?Ð°Ñ?ÑÑÐ¾ÐµÐ½Ð¸ÐµÑÐ¾ Ð¼Ð¸ Ð´Ð½ÐµÑ? Ðµ Ð² Ð...,167


In [16]:
train_df.loc[mask1, 'text_len'].hist(bins=30)

<matplotlib.axes._subplots.AxesSubplot at 0x7f75d1797410>

In [17]:
train_df = train_df[~mask1].drop('text_len', axis=1)
print("{0:,} records remaining in the DataFrame.".format(len(train_df)))

1,599,306 records remaining in the DataFrame.


## Parse hashtags from tweets

In [18]:
t = time()
train_df['hashtags'] = train_df['text'].apply(lambda text: " ".join(re.findall(r'#\w+', text)))
elapsed = time() - t
print("Hashtags have been extracted into a new column 'hashtags' of the DataFrame!"
      "Took {0:,.2f} seconds ({1:,.2f} minutes)".format(elapsed, elapsed / 60))

Hashtags have been extracted into a new column 'hashtags' of the DataFrame!Took 2.77 seconds (0.05 minutes)


## Parse user handles from tweets

In [19]:
t = time()
train_df['handles'] = train_df['text'].apply(lambda text: " ".join(re.findall(r'@\w+', text)))
elapsed = time() - t
print("User @ handles have been extracted into a new column 'handles' of the DataFrame!"
      "Took {0:,.2f} seconds ({1:,.2f} minutes)".format(elapsed, elapsed / 60))

User @ handles have been extracted into a new column 'handles' of the DataFrame!Took 3.35 seconds (0.06 minutes)


## Apply text preprocessor
Preprocessor defined in `nlp_utils` does the following:
* Converts all words to lower case 
* Removes all non-word characters
* Moves emoticons to the end of the string, remove the "nose" symbol '-'
* Removes all non-English unicode characters (not from the ASCII character set)
* Strips leading and trailing whiteplace, replaces multiple spaces with a single one

In [20]:
import re


def preprocessor(text):
    """
    text preprocessor
    :param text: an input string
    :return:
    """
    # remove everything between '<' and '>', can be used to remove HTML tags
    # text = re.sub('<[^>]*>', '', text)
    # temporarily store all emoticons
    emoticons = re.findall('[:;=](?:-)?[)(D|\(/)\(\)]', text)
    # lower case, remove all non-word characters, add emoticons to the end of the string, remove the "nose" '-'
    text = (re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', ''))
    # remove non-English unicode characters (not from the ASCII character set)
    text = re.sub('[^\x00-\x7F]', '', text)
    # strip leading and trailing whitespaces, replace multiple spaces with a single
    text = re.sub('\s+', ' ', text).strip()
    return text


In [21]:
#TODO does not detect ':\', ':/', leaves 'p' and 'd' for ':D' and ':P'

In [22]:
preprocessor("     :\ This :)    is :| *%^        :( <?> ä½ä¹ :-P a test :-)!     ")

'this is p a test :) :| :( :)'

In [23]:
mask1 = train_df['text'].str.contains('[^\x00-\x7F]')
train_df.loc[mask1, 'text']

6483       Ï '??k  **pouty face** Shitty day out in Bost...
175326     Oh joy its gong to be a long weekebd... Yipee ...
240454     faceyourmanga.com áá±áá¬áá¹á¸áá°á...
240464     @Buou å¦å èå å¤±è¯¯ä¸è¶ åæ?¥å?¶æ²¡æ...
240578     nÎ±o curti Î± musicÎ± novÎ± dÎ± @fresnorock nÎ±o 
                                 ...                        
1599682    @perequintana ara sÃ­ que ets tot un "pirata" ...
1599911    Oh my god. Cookie dough frijj. I just spaffed ...
1599933    Lmao @seizuresalad that was cute and awesome  ...
1599980    @myheartandmind jo jen by nemuselo zrovna tÃ© ...
1599996    TheWDB.com - Very cool to hear old Walt interv...
Name: text, Length: 9795, dtype: object

In [24]:
t = time()
train_df['text'] = train_df['text'].apply(preprocessor)
elapsed = time() - t
print("Text preprocessor applied to corpus, took {0:,.2f} seconds ({1:,.2f} minutes)"
      .format(elapsed, elapsed / 60))

Text preprocessor applied to corpus, took 32.84 seconds (0.55 minutes)


In [25]:
mask1 = train_df['text'].str.contains('[^\x00-\x7F]')
train_df.loc[mask1, 'text']

Series([], Name: text, dtype: object)

## Remove extra whitespace

In [26]:
mask1 = train_df['text'].str.contains('  ')
train_df.loc[mask1, 'text']

Series([], Name: text, dtype: object)

## Save results to a .csv file

In [27]:
save_path = data_dir + 'sentiment140_train_cleaned.csv'
t = time()
train_df.to_csv(save_path, index=False)
elapsed = time() - t
print("DataFrame saved to file\n{0}\ntook {1:,.2f} seconds ({2:,.2f} minutes)"
      .format(save_path, elapsed, elapsed / 60))


DataFrame saved to file
../../data/sentiment140/sentiment140_train_cleaned.csv
took 14.77 seconds (0.25 minutes)
