<a href="https://colab.research.google.com/github/realbluesnail/UNCC_DSBA6188/blob/main/DSBA6188_Preprocessing_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Initial Setup
- Import packages
- Mount your google drive (if you would like to you read and save data files from your google drive)

In [6]:
import numpy as np
import pandas as pd
# import seaborn
# import matplotlib.pyplot as plt
# # from wordcloud import WordCloud ## don't need it this time

In [2]:
import nltk
import nltk.corpus


In [3]:
# Importing word_tokenize from nltk
from nltk.tokenize import (word_tokenize,
                           sent_tokenize,
                           TreebankWordTokenizer,
                           wordpunct_tokenize,
                           TweetTokenizer,
                           MWETokenizer)
# Get the tokenizer to divide text into sentences
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [4]:
# If you would like to save and read data files from your Google drive
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


# Take home Assignment:
Create your own pre-processing pipeline (from raw text to tokens)

### Things to consider

- What kind of the text are "relevant" to your use case and what are not? What to keep and what to remove?
- What techniques are you going to use to filter out the texts that are "irrelevant"?
- What is the right order to apply these techniques?

#### 1. Load the twitter data and create an sample set

In [7]:
# [Optional step]: If you would like to take a look at more sample data from the twitter dataset
# To load the example data set, not that you might need to change the file path to where you save the tweeter_training.csv. The data set is also avaiable on canvas for download
tw_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/tweeter_training.csv', encoding='ISO-8859-1', header=None)
# Add column names
column_names = ['target', 'ids', 'date', 'flag', 'user', 'text']
tw_df.columns = column_names

In [8]:
# Exam the data frame
tw_df.info()
tw_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
 #   Column  Non-Null Count    Dtype 
---  ------  --------------    ----- 
 0   target  1600000 non-null  int64 
 1   ids     1600000 non-null  int64 
 2   date    1600000 non-null  object
 3   flag    1600000 non-null  object
 4   user    1600000 non-null  object
 5   text    1600000 non-null  object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


Unnamed: 0,target,ids,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [9]:
# Test Preprocessing and Tokenizers with a sample set
sample_tw = tw_df.sample(10).text.values
sample_tw

array(['@ddlovato http://twitpic.com/65r98 - u look so gorgeous grl!!! so mature and sophisticated ',
       'Break a leg tonight in London Brit! Wish I could be there ',
       '@rocker_chick22 yes! that stupid hooker - wow two stupid hookers in one day! [kelseys ;)!] &amp; amen at the tickets  no blink concert now ',
       'My breath fogged up the glass so I drew a face and I laughed ',
       '@xpianogirl what?! ooh  i think i commented in the past post..omg, my laptop is so SILLY! he sent me to the other post :S',
       '@elsua Thanks for keeping me up-to-date -- we were never told that an account was set up for @ffblog Had to find it in your tweets ',
       '@golfnovels yowza - that sounds like a part-ay. ',
       '@bootynbrainz awwwwwwwwwwwwe  you told me i never had a chance! LOL!!!',
       "@ameliesoleil sept. I have loads to do to get up to date. I'm setting weekly tasks. This week its form filling and a meeting ",
       'last monday of the school year '], dtype=object)

#### 2. Create the preprocessing pipeline

- Consider wrapping your preprocessing pipeline into a callable function
- Depending on your use case and the downstream functions that the preprocessing pipeline will be plugged into, design the input and output of your preprocessing function accordingly.

In the example below, both the input and output of the preprocessing function is a string.

In [10]:

# Tweet Tokenizer
from nltk.tokenize import TweetTokenizer
import re # Regular Expression
import string
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

def preprocess_tweet(tweet):
  tweets_processed = []
  my_tokenizer = TweetTokenizer()
  #step 1: lower case
  tweet_lower = tweet.lower()
  #step 2: remove url
  url_pattern = re.compile(r'https?://\S+|www\.\S+')
  tweet_wo_url = url_pattern.sub('', tweet_lower)
  #step 3: twitter handles
  handle_pattern = re.compile(r'@\S+')
  tweet_wo_handle = handle_pattern.sub('', tweet_wo_url)
  #step 4: remove stopwords and punctuations
  tokens = my_tokenizer.tokenize(tweet_wo_handle)
  tokens_wo_stopword = [token for token in tokens if token not in stop_words]
  tokens_wo_punct = [token for token in tokens_wo_stopword if token not in string.punctuation]
  tweet_processed = ' '.join(tokens_wo_punct)
  return tweet_processed



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [14]:

tweets_processed = []
for tweet in sample_tw:
    tweets_processed.append(preprocess_tweet(tweet))

tweets_processed


['u look gorgeous grl mature sophisticated',
 'break leg tonight london brit wish could',
 'yes stupid hooker wow two stupid hookers one day kelseys ;) amen tickets blink concert',
 'breath fogged glass drew face laughed',
 'ooh think commented past post .. omg laptop silly sent post',
 'thanks keeping up-to-date never told account set find tweets',
 'yowza sounds like part-ay',
 'awwwwwwwwwwwwe told never chance lol',
 "sept loads get date i'm setting weekly tasks week form filling meeting",
 'last monday school year']