# NLP Basics

In this notebook we shall discuss different preprocessing steps to apply on tweets.

## Import the libraries

We will use **nltk**. If nltk is not installed, you can install it by `pip install nltk` in anaconda prompt. Once nltk is installed do the following:

```python
import nltk
nltk.download('stopwords')          # this will download the stopwords
nltk.download('twitter_samples')    # this will download labelled twitter samples for sentiment analysis
```

In [1]:
import nltk
import numpy as np
from nltk.corpus import twitter_samples

## Get the data

The `twitter_samples` contains subsets of 5000 positive tweets and 5000 negative tweets, and a full set of 10000 tweets.
We need to use 5000 positive and 5000 negative tweets.

In [3]:
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

In [5]:
all_positive_tweets[:5]

['#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)',
 '@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!',
 '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!',
 '@97sides CONGRATS :)',
 'yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days']

In [6]:
all_negative_tweets[:5]

['hopeless for tmr :(',
 "Everything in the kids section of IKEA is so cute. Shame I'm nearly 19 in 2 months :(",
 '@Hegelbon That heart sliding into the waste basket. :(',
 '“@ketchBurning: I hate Japanese call him "bani" :( :(”\n\nMe too',
 'Dang starting next week I have "work" :(']

## Preprocessing

In [19]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
import re
import string

In [8]:
stopwords_english = stopwords.words('english')

In [10]:
len(stopwords_english)

179

In [13]:
tweet = all_positive_tweets[0]

In [15]:
re.sub(r'#', '', tweet)

'FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)'

In [24]:
def process_tweet(tweet):
    """
    This function process a tweet
    input:
        tweet: string -> a string containing tweet
    output:
        tweet_clean: string -> a list of words containing processed tweets
    """
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    
    # remove the hashtags -> remove the # sign from the word
    tweet = re.sub(r'#','',tweet)
    
    # remove the stock market tickers like $AMAZON
    tweet = re.sub(r'\$\w*', '', tweet)
    
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+','', tweet)
    
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    
    # remove numbers
    tweet = re.sub("\d+", "", tweet)
    
    # tokenize the tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    
    tweet_tokens = tokenizer.tokenize(tweet)
    
    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and   # removes stopwords
            word not in string.punctuation):    # removes punctuations
            stem_word = stemmer.stem(word)
            tweets_clean.append(stem_word)
            
    return tweets_clean

In [25]:
for x in all_positive_tweets[:5]:
    print(process_tweet(x))

['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']
['hey', 'jame', 'odd', ':/', 'pleas', 'call', 'contact', 'centr', 'abl', 'assist', ':)', 'mani', 'thank']
['listen', 'last', 'night', ':)', 'bleed', 'amaz', 'track', 'scotland']
['congrat', ':)']
['yeaaah', 'yipppi', 'accnt', 'verifi', 'rqst', 'succeed', 'got', 'blue', 'tick', 'mark', 'fb', 'profil', ':)', 'day']


In [26]:
for x in all_negative_tweets[:5]:
    print(process_tweet(x))

['hopeless', 'tmr', ':(']
['everyth', 'kid', 'section', 'ikea', 'cute', 'shame', "i'm", 'nearli', 'month', ':(']
['heart', 'slide', 'wast', 'basket', ':(']
['“', 'hate', 'japanes', 'call', 'bani', ':(', ':(', '”']
['dang', 'start', 'next', 'week', 'work', ':(']


## Generate vocabularies

In [44]:
corpus = ["I love my country", "I hate EDM", "I love to do math", "I am very bad at history"]

In [43]:
def create_vocabulary(corpus):
    vocabulary = []
    for x in corpus:
        for word in process_tweet(x):
            if word not in vocabulary:
                vocabulary.append(word)
    return vocabulary

In [46]:
vocab = create_vocabulary(corpus)

vocab

['love', 'countri', 'hate', 'edm', 'math', 'bad', 'histori']

In [47]:
len(vocab)

7

Let's create the vocabulary for our tweets

In [51]:
tweet_corpus = all_positive_tweets + all_negative_tweets

In [52]:
len(tweet_corpus)

10000

In [53]:
tweet_vocab = create_vocabulary(tweet_corpus)

In [54]:
len(tweet_vocab)

9905