# Project: Sentiment Classification
- Make a model to determine whether a tweet positive or negative

### Step 1: Import the libraries

In [11]:
import nltk
import string
from nltk.tag import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer
from nltk import classify
from nltk.corpus import stopwords
from nltk import NaiveBayesClassifier
from random import shuffle


### Step 2: Download the sample tweets
- Execute the following cell

In [12]:
nltk.download('twitter_samples')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('omw-1.4')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /home/adel/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/adel/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /home/adel/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/adel/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /home/adel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/adel/nltk_data...
[nltk_data]   Unzipping corpora/omw-1.4.zip.


True

### Step 3: The tweets
- Get the positive and negative tweets.
    - HINT: You access the positive tweets by: **nltk.corpus.twitter_samples.strings('positive_tweets.json')**
    - HINT: Similarly for the negative tweets.
- Notice: There is also tweets with no sentiment - we will ignore them in this project
- Check a few tweets

In [13]:
positive_tweets = nltk.corpus.twitter_samples.strings('positive_tweets.json')
negative_tweets = nltk.corpus.twitter_samples.strings('negative_tweets.json')

In [14]:
positive_tweets[0]

'#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)'

### Step 4: Tokenize the tweets
- You get the tokenized tweets as follows:
    - **nltk.corpus.twitter_samples.tokenized('positive_tweets.json')**
    - Simlarly for **negative_tweets**
- Why tokenize?
    - To make processing easier
- Check a few tweets (tokenized)

In [15]:
positive_tweets = nltk.corpus.twitter_samples.tokenized('positive_tweets.json')
negative_tweets = nltk.corpus.twitter_samples.tokenized('negative_tweets.json')

In [16]:
positive_tweets[0]

['#FollowFriday',
 '@France_Inte',
 '@PKuchly57',
 '@Milipol_Paris',
 'for',
 'being',
 'top',
 'engaged',
 'members',
 'in',
 'my',
 'community',
 'this',
 'week',
 ':)']

### Step 5: Remove noise from data
- The following tokens do not add value in our analysis
    - Twitter usernames (starting with @)
    - Hyperlinks (starting with http:// or https://)
    - Punctuation and special characters
        - HINT: if word in **string.punctuation**
    - Numeric values only
        - HINT: use **.isnumeric()**
    - If word is a stopword ([wiki](https://en.wikipedia.org/wiki/Stop_word))
        - HINT: Check if lower case word is in **stopwords.words('english')**
- To simplify createa a helper function **is_clean** to check for the above
- Create another helper function **clean_tokens**
    - The function takes **tokens** (a list of tokens) as input
    - Then returns a list of tokens, where **is_clean** has been used to filter
    - Also, let's lowercase it all
        - HINT: Use **lower()**
- Finally, use list comprehension on the lists of positive and negative tweets where **clean_tokens** is applied on each element (tokens).

In [17]:
def is_clean(word: str):
  if word.startswith('@'):
    return False
  if word.startswith('http://') or word.startswith('https://'):
    return False
  if word in string.punctuation:
    return False
  if word.isnumeric():
    return False
  if word in stopwords.words('english'):
    return False
  return True


def clean_tokens(tokens: list):
  return [word.lower() for word in tokens if is_clean(word)]

positive_tweets_cleaned = [clean_tokens(tokens) for tokens in positive_tweets]
negative_tweets_cleaned = [clean_tokens(tokens) for tokens in negative_tweets]

In [18]:
positive_tweets_cleaned[0]

['#followfriday', 'top', 'engaged', 'members', 'community', 'week', ':)']

In [19]:
negative_tweets_cleaned[0]

['hopeless', 'tmr', ':(']

### Step 6: Normalize the data
- The process of converting a word to its canonical form.
- Without normalization, “ran”, “runs”, and “running” would be treated as different words.
- Create a lemmatizer of **WordNetLemmatizer()**
    - HINT: use **lemmatizer = WordNetLemmatizer()**
- Create a helper function to lemmatize
    - HINT: Create a helper function **lemmatize(word, tag)**
        - Convert tag to **n** or **v** if tag starts with **NN** or **VB**, else **a**
        - Return **lemmatizer.lemmatize(word, tag)**
- Create a helper function **lemmatize_tokens(tokens: list)**
    - Return a list, where each element of **word, tag in pos_tag(...)** of **lemmatize(word, tag)**.
- Use list comprehension to normalize the positive and negative tweets
    - HINT: apply **lemmatize_tokens(...)** on all elements

In [20]:
 lemmatizer = WordNetLemmatizer()

 def  lemmatize(word: str, tag: str):
     if tag.startswith('NN'):
       pos = 'n'
     elif tag.startswith('VB'):
        pos = 'v'
     else:
        pos = 'a'
     return lemmatizer.lemmatize(word, pos)

def lemmatize_tokens(tokens:list):
  return [lemmatize(word, tag) for word, tag in pos_tag(tokens)]


positive_tweets_normalized = [lemmatize_tokens(tokens) for tokens in positive_tweets_cleaned]
negative_tweets_normalized = [lemmatize_tokens(tokens) for tokens in negative_tweets_cleaned]

In [21]:
positive_tweets_normalized[0]

['#followfriday', 'top', 'engaged', 'member', 'community', 'week', ':)']

In [22]:
negative_tweets_normalized[0]

['hopeless', 'tmr', ':(']

### Step 7: Prepare data for Model
- Example of normalized tweet: **['hopeless', 'tmr', ':(']**
    - Should become **({'hopeless': True, 'tmr': True, ':(': True}, 'Negative')**
- Hence, the list of tweets (positive and negative) should be converted
- HINT: use a dict comprehension inside a list comprehension

In [23]:
positive_dataset = [({token: True for token in tokens}, 'Positive') for tokens in positive_tweets_normalized]
negative_dataset = [({token: True for token in tokens}, 'Negative') for tokens in negative_tweets_normalized]

In [24]:
positive_dataset[0]

({'#followfriday': True,
  'top': True,
  'engaged': True,
  'member': True,
  'community': True,
  'week': True,
  ':)': True},
 'Positive')

In [25]:
negative_dataset[0]

({'hopeless': True, 'tmr': True, ':(': True}, 'Negative')

### Step 8: Prepare training and test dataset
- Make the dataset of the combined positive and negative datasets
- Shuffle the dataset
    - Use **shuffle**
- Let the training dataset be the first 7000 entries
- Let the test dataset be the remaining entries

In [26]:
dataset = positive_dataset + negative_dataset

shuffle(dataset)

train_ds = dataset[:7000]
test_ds = dataset[7000:]

### Step 9: Train and test Model
- Train the model:
    - HINT: **classifier = NaiveBayesClassifier.train(train_data)**
- Test the accuracy
    - HINT: **classify.accuracy(classifier, test_data)**

In [27]:
classifier = NaiveBayesClassifier.train(train_ds)

In [28]:
classify.accuracy(classifier, test_ds)

0.9963333333333333

### Step 10: Show the most informative features
- HINT: Get the 10 most informative features: **classifier.show_most_informative_features(10)**

In [29]:
classifier.show_most_informative_features(10)

Most Informative Features
                      :) = True           Positi : Negati =    991.9 : 1.0
                     sad = True           Negati : Positi =     25.0 : 1.0
                     bam = True           Positi : Negati =     22.4 : 1.0
                follower = True           Positi : Negati =     19.8 : 1.0
                     too = True           Negati : Positi =     19.0 : 1.0
              appreciate = True           Positi : Negati =     17.0 : 1.0
                     x15 = True           Negati : Positi =     17.0 : 1.0
               community = True           Positi : Negati =     16.4 : 1.0
                    damn = True           Negati : Positi =     14.3 : 1.0
                  arrive = True           Positi : Negati =     12.2 : 1.0


### Step 11: Test the model
- Try your model as follows:
    - Define a tweet: **tweet = 'this is fun and awesome'**
    - Prepare data for model: **tweet_dict = {token: True for token in lemmatize_tokens(clean_tokens(tweet.split()))}**
    - Classify data: **classifier.classify(tweet_dict)**

In [30]:
tweet = 'this is fun and awesome'
tweet_dict = {token: True for token in lemmatize_tokens(clean_tokens(tweet.split()))}

In [31]:
classifier.classify(tweet_dict)

'Positive'

### Bonus: The pre-trained Sentiment Intensity Analyzer
-  VADER (Valence Aware Dictionary and sEntiment Reasoner) ([Vader](https://www.nltk.org/howto/sentiment.html))

In [32]:
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/adel/nltk_data...


True

In [33]:
sia = SentimentIntensityAnalyzer()

In [34]:
sia.polarity_scores('this is fun and awesome')

{'neg': 0.0, 'neu': 0.288, 'pos': 0.712, 'compound': 0.8126}