# Text Preprocessing Basics

In this notebook we shall discuss different preprocessing steps to apply on tweets.

## Import the libraries

We will use **nltk**. If nltk is not installed, you can install it by `pip install nltk` in your virtual environment. Once nltk is installed do the following:

```python
import nltk
nltk.download('stopwords')          # this will download the stopwords
nltk.download('twitter_samples')    # this will download labelled twitter samples for sentiment analysis
nltk.download('wordnet')            # this will download the wordnet for lemmatization
```

In [23]:
import nltk
nltk.download('stopwords')
nltk.download('twitter_samples')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to C:\Users\Sourav
[nltk_data]     Karmakar\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package twitter_samples to C:\Users\Sourav
[nltk_data]     Karmakar\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Sourav
[nltk_data]     Karmakar\AppData\Roaming\nltk_data...


True

In [9]:
import numpy as np
from nltk.corpus import twitter_samples

In [10]:
twitter_samples.fileids()

['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']

## Get the data

The `twitter_samples` contains subsets of 5000 positive tweets and 5000 negative tweets, and a full set of 10000 tweets.
We need to use 5000 positive and 5000 negative tweets.

In [11]:
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

In [12]:
all_positive_tweets[:5]

['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)',
 '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!',
 '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!',
 '@97sides CONGRATS :)',
 'yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days']

In [13]:
all_negative_tweets[:5]

['hopeless for tmr :(',
 "Everything in the kids section of IKEA is so cute. Shame I'm nearly 19 in 2 months :(",
 '@Hegelbon That heart sliding into the waste basket. :(',
 '“@ketchBurning: I hate Japanese call him "bani" :( :(”\n\nMe too',
 'Dang starting next week I have "work" :(']

## Preprocessing

In [14]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import TweetTokenizer
import re
import string

In [15]:
stopwords_english = stopwords.words('english')

In [16]:
len(stopwords_english)

198

In [17]:
stopwords_english[:20]

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been']

In [19]:
tweet = all_positive_tweets[0]

tweet

'#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)'

In [21]:
re.sub(r'@', '', tweet)

'#FollowFriday France_Inte PKuchly57 Milipol_Paris for being top engaged members in my community this week :)'

### Difference between Stemming and Lemmatization

In [28]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ["studies", "studying", "changing", "caring", "fairness"]

stemmed_words = [stemmer.stem(word) for word in words]
lemmatized_words = [lemmatizer.lemmatize(word, pos='v') for word in words]

print("Stemmed words: ", stemmed_words)

print("Lemmatized words: ", lemmatized_words)

Stemmed words:  ['studi', 'studi', 'chang', 'care', 'fair']
Lemmatized words:  ['study', 'study', 'change', 'care', 'fairness']


In [29]:
def process_tweet(tweet):
    """
    This function process a tweet
    input:
        tweet: string -> a string containing tweet
    output:
        tweet_clean: string -> a list of words containing processed tweets
    """
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    
    # remove the hashtags -> remove the # sign from the word
    tweet = re.sub(r'#','',tweet)
    
    # remove the stock market tickers like $AMAZON
    tweet = re.sub(r'\$\w*', '', tweet)
    
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+','', tweet)
    
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    
    # remove numbers
    tweet = re.sub(r"\d+", "", tweet)

    # remove simple emoticons like :) :( :D :/ :-/ :-(
    tweet = re.sub(r'[:;=8][-]?[)(/DPp]', '', tweet)
    
    # tokenize the tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    
    tweet_tokens = tokenizer.tokenize(tweet)
    
    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and   # removes stopwords
            word not in string.punctuation):    # removes punctuations
            stem_word = stemmer.stem(word)
            tweets_clean.append(stem_word)
            
    return tweets_clean

In [31]:
all_positive_tweets[:5]

['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)',
 '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!',
 '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!',
 '@97sides CONGRATS :)',
 'yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days']

In [30]:
for x in all_positive_tweets[:5]:
    print(process_tweet(x))

['followfriday', 'top', 'engag', 'member', 'commun', 'week']
['hey', 'jame', 'odd', 'pleas', 'call', 'contact', 'centr', 'abl', 'assist', 'mani', 'thank']
['listen', 'last', 'night', 'bleed', 'amaz', 'track', 'scotland']
['congrat']
['yeaaah', 'yipppi', 'accnt', 'verifi', 'rqst', 'succeed', 'got', 'blue', 'tick', 'mark', 'fb', 'profil', 'day']


In [33]:
all_negative_tweets[:5]

['hopeless for tmr :(',
 "Everything in the kids section of IKEA is so cute. Shame I'm nearly 19 in 2 months :(",
 '@Hegelbon That heart sliding into the waste basket. :(',
 '“@ketchBurning: I hate Japanese call him "bani" :( :(”\n\nMe too',
 'Dang starting next week I have "work" :(']

In [32]:
for x in all_negative_tweets[:5]:
    print(process_tweet(x))

['hopeless', 'tmr']
['everyth', 'kid', 'section', 'ikea', 'cute', 'shame', 'nearli', 'month']
['heart', 'slide', 'wast', 'basket']
['“', 'hate', 'japanes', 'call', 'bani', '”']
['dang', 'start', 'next', 'week', 'work']


## Generate vocabularies

In [34]:
corpus = ["I love my country :)", "I hate EDM", "I love to do math", "I am very bad at history :("]

In [35]:
def create_vocabulary(corpus):
    vocabulary = []
    for x in corpus:
        for word in process_tweet(x):
            if word not in vocabulary:
                vocabulary.append(word)
    return vocabulary

In [36]:
vocab = create_vocabulary(corpus)

vocab

['love', 'countri', 'hate', 'edm', 'math', 'bad', 'histori']

In [37]:
len(vocab)

7

Let's create the vocabulary for our tweets

In [38]:
tweet_corpus = all_positive_tweets + all_negative_tweets

In [39]:
len(tweet_corpus)

10000

In [40]:
tweet_vocab = create_vocabulary(tweet_corpus)

In [41]:
len(tweet_vocab)

9870

In [42]:
tweet_vocab

['followfriday',
 'top',
 'engag',
 'member',
 'commun',
 'week',
 'hey',
 'jame',
 'odd',
 'pleas',
 'call',
 'contact',
 'centr',
 'abl',
 'assist',
 'mani',
 'thank',
 'listen',
 'last',
 'night',
 'bleed',
 'amaz',
 'track',
 'scotland',
 'congrat',
 'yeaaah',
 'yipppi',
 'accnt',
 'verifi',
 'rqst',
 'succeed',
 'got',
 'blue',
 'tick',
 'mark',
 'fb',
 'profil',
 'day',
 'one',
 'irresist',
 'flipkartfashionfriday',
 'like',
 'keep',
 'love',
 'custom',
 'wait',
 'long',
 'hope',
 'enjoy',
 'happi',
 'friday',
 'lwwf',
 'second',
 'thought',
 '’',
 'enough',
 'time',
 'dd',
 'new',
 'short',
 'enter',
 'system',
 'sheep',
 'must',
 'buy',
 'jgh',
 'go',
 'bayan',
 'bye',
 'act',
 'mischiev',
 'etl',
 'layer',
 'in-hous',
 'wareh',
 'app',
 'katamari',
 'well',
 '…',
 'name',
 'impli',
 'influenc',
 'big',
 '...',
 'juici',
 'selfi',
 'follow',
 'perfect',
 'alreadi',
 'know',
 "what'",
 'great',
 'opportun',
 'junior',
 'triathlet',
 'age',
 'gatorad',
 'seri',
 'get',
 'entri',
