copied from:

https://github.com/ZacLanghorne/FBSentimentAnalysis/blob/master/FacebookSentimentAnalysis.ipynb

steps:

1. _tokenising_

transforms the full sentences into words by removing stop words, conjunctions, etc

2. _normalising and noise reduction_

this is done by lemmatising the tweets; meaning it takes the words to their root. For example, you have are, am, is --> to be or cars, car's, car --> car. 

- determine word density 
- build model
- explorative analysis

# Imports


In [1]:
import nltk
import random
import pandas as pd
from bs4 import BeautifulSoup

In [2]:
import collections
from nltk.corpus import twitter_samples # pre scraped tweets
from nltk.corpus import stopwords 
stop_words = stopwords.words('english') # enlgish tokenisation
from nltk import FreqDist # stats 
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tag import pos_tag

In [3]:
# remove noise
import re
import string

In [4]:
from nltk import classify
from nltk import NaiveBayesClassifier
import matplotlib.pyplot as plt
import sys
import requests

In [5]:
# for tokenisation
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sherv\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [6]:
# toeknising 
nltk.download('punkt')
# normalising and lemmatising text
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sherv\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sherv\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\sherv\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [7]:
# get twitter samples 
# scraping twitter is only allowed under certain conditions
# we train with sample data and apply to scraped data
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\sherv\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

# Tokenising

### what is twitter samples
*fileids*
- to know what the strings method takes

*strings*
- gets tweets in json format

*tokenized*
- you get tokenised tweets but we want to tokenise it ourselves to practice

In [16]:
twitter_samples.fileids()

['negative_tweets.json', 'positive_tweets.json', 'tweets.20150430-223406.json']

In [17]:
positive_tweets = twitter_samples.strings("positive_tweets.json")
negative_tweets = twitter_samples.strings("negative_tweets.json")
text = twitter_samples.strings('tweets.20150430-223406.json')

In [18]:
# Tokenise the tweets
positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

# Normalising and removing noise

### what is re (regular expression operations)

1. A regular expression (or RE) specifies a set of strings that matches it. The .sub() method substitutes it with the given string.

2. re contains special characters as well so if we want to search for certain characters we need to escape them with `\`. These characters are:
    - `.` - searches for any character except new line
    - `\d` - digits (0-9)
    - `\D` - not a digit (0-9)
    - `w` - word character (a-z, A-Z, 0-9, _)
    - `W` - not a word character
    - `s` - white space
    - `S` - not a white space 
    <br>
    <br>
    Anchors (invisible positions before or after characters):
    - `b` - word boundary ("\bHa", "Ha HaHa") would return "**Ha** **Ha**Ha"
    - `B` - Not a word boundary ("\bHa", "Ha HaHa") would return "Ha Ha**Ha**"
    - `^` - Beginning of a string ("^http", "http://www.blabla/http") would return "**http**://www.blabla/http"
    - `DollarSign` - end of string ("http`DollarSign`", "http://www.blabla/http") would return "http://www.blabla/**http**"
    <br>
    <br>
    - `[]` - searches for whatever is in the bracket eg: ("\d[-.]\d", "1.1, 2-3, 2/2") would return "**1.1**, **2-3**, 2/2" <br>
    &nbsp; `[1-5]` creates a range and will choose ONE character in that range and list <br>
    &nbsp; `[^a-z]` using `^` within a character set searches for everything except what's included in the set eg ("[^b]at",  "cat, pat, bat") returns "**cat**, **pat**, bat"
    - `|` - either or
    - `()` - group
    <br>
    <br>
    Quantifiers:
    - `*` - 0 or more
    - `+` - 1 or more
    - `?` - 0 or one
    - `{3}` - exact number eg: ("\d{3}", "123, 1234, 12, 1") will return  "**123**, 1234, 12, 1"
    - `{3,4}` - range of numbers
    
When you create groups, you can select them from the returned object by the re. Group 0 is always the entire match returned. eg: <br>
```python
url = "https://www.google.com, https://blabla.com"
pattern = re.compile(r"https?://(www\.)?(\w+)(\.\w+)") # group 1: www. group 2: subdomain group 3:.domain
matches = pattern.finditer(url) # or use pattern.findall(url)
for match in matches:
    print(match.group(1))
```

you can also refer to these groups when substituting. eg: <br>
```python
subbed_url = pattern.sub(r"\2\3", url) # subs groups 2 and 3
```

Finally, you can ignore case using a flag. eg <br>
```python
pattern = re.compile(r"start", re.IGNORECASE) # shorthand flag is re.I
matches = pattern.findall(url)
print(matches)
```

In [19]:
def remove_noise(tweet_tokens, stop_words = ()):
    cleaned_tokens = []
    
    for token, tag in pos_tag(tweet_tokens):
        token = re.sub(r"^https?:\/\/.*[\r\n]*","", token) # removes https://<0 or more characters><0 or more new lines>
        token = re.sub(r"(@[A-Za-z0-9_]+)","", token) # removes mentions from the tweet
        
        if tag.startswith("NN"):
            pos = "n" # assign names as nouns
        elif tag.startswith("VB"):
            pos = "v" # assign verbs as verbs
        else:
            pos = "a"
            
        lemmatizer = WordNetLemmatizer()
        token = lemmatizer.lemmatize(token,pos) # reduce words to their root words
        
        if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_words:
            # gets rid of the empty tweets and any punctuation
            cleaned_tokens.append(token.lower())
        
        return cleaned_tokens

In [20]:
positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []

In [21]:
for tokens in positive_tweet_tokens:
    positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words)) 
for tokens in negative_tweet_tokens:
    negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

# Word Density

In [22]:
# returns an iterator for all the words in the dictionary
def get_all_words(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        for token in tokens:
            yield token

In [23]:
all_pos_words = get_all_words(positive_cleaned_tokens_list)
all_neg_words = get_all_words(negative_cleaned_tokens_list)

In [24]:
# Make a frequency distribution to find most common words.

freq_dist_pos = FreqDist(all_pos_words)
freq_dist_neg = FreqDist(all_neg_words)

In [25]:
# Convert the values to a dictionary for us in naive bayes classification.

def get_tweets_for_model(cleaned_tokens_list):
    for tweet_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in tweet_tokens)

In [26]:
# Make the lists of model ready data.

positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

In [27]:
# Split the data in to train and test data.

positive_dataset = [(tweet_dict, "Positive")
                     for tweet_dict in positive_tokens_for_model] # Label the positive data.

negative_dataset = [(tweet_dict, "Negative")
                     for tweet_dict in negative_tokens_for_model] # Label the negative data.

dataset = positive_dataset + negative_dataset # Combine the data sets.

random.shuffle(dataset) # Shuffle the data so theres no natural ordering.

train_data = dataset[:int(len(dataset)*0.7)] # First 7000 entries for train.
test_data = dataset[int(len(dataset)*0.7):] # Final 3000 for testing.

# test model

In [28]:
classifier = NaiveBayesClassifier.train(train_data) # must creaate dictionary

In [29]:
# Return the accuracy and the words that are most useful in determing the sentiment.
print("Accuracy is:", classify.accuracy(classifier, test_data))

print(classifier.show_most_informative_features(10))

Accuracy is: 0.5376666666666666
Most Informative Features
                   happy = True           Positi : Negati =     19.2 : 1.0
                      hi = True           Positi : Negati =     15.8 : 1.0
                   hello = True           Positi : Negati =     10.0 : 1.0
                  really = True           Negati : Positi =      7.2 : 1.0
                       “ = True           Negati : Positi =      6.5 : 1.0
                     get = True           Positi : Negati =      4.9 : 1.0
                     hey = True           Positi : Negati =      4.9 : 1.0
                  follow = True           Positi : Negati =      4.9 : 1.0
                    cant = True           Negati : Positi =      3.7 : 1.0
                   sorry = True           Negati : Positi =      3.5 : 1.0
None


In [30]:
# Test out the model with custom tweets!
custom_tweet = "Test message"
custom_tokens = remove_noise(nltk.word_tokenize(custom_tweet))

print(classifier.classify(dict([token, True] for token in custom_tokens)))

Positive


# Facebook Scraping

In [None]:
# Initialise the file to scrape.
# url = 'D:\User\*File_Location*' + str(1) + '.html'
# soup = BeautifulSoup(open(url, encoding = 'utf8').read(), 'html.parser')

In [8]:
# fb = requests.get("https://facebook.com")
# TODO: create a session and log in and scrape the live feed

In [31]:
fb = open("facebook.html",encoding='utf8').read()

In [33]:
soup = BeautifulSoup(fb,"html.parser")

In [35]:
# Scrape the messages.
Texts = [] # Initialise the Texts vector.
j = 0
for div in soup.find_all('div', class_ = '_3-96 _2let'): ## Returns the message.
    Texts.append(div.text) ## Saves output to Texts
    j += 1

In [37]:
Texter = []
for div in soup.find_all('div', class_ = '_3-96 _2pio _2lek _2lel'): # Returns the person that sent the text
        Texter.append(div.text) ## Saves output to Texter

In [38]:
mess_date = []
for div in soup.find_all('div', class_ = '_3-94 _2lem'): # Returns the person that sent the text
    mess_date.append(div.text) ## Saves output to Texter

### Apply model to facebook

In [39]:
text_tokens = []

for msg in Texts:
    text_tokens.append(nltk.word_tokenize(msg)) # Tokenize the messages.

txt_cleaned_tokens_list = []

for tokens in text_tokens:
    txt_cleaned_tokens_list.append(remove_noise(tokens, stop_words)) ## Clean the messages.

texts_for_analysis = get_tweets_for_model(Texts)

msg_clas = []
for txts in texts_for_analysis:
    msg_clas.append(classifier.classify(dict([token, True] for token in txts))) ## Run the messages through a calssifier.

In [40]:
FullTexts = list(zip(Texter,Texts, mess_date, msg_clas))
Mess_df = pd.DataFrame(FullTexts, columns = ['Texter','Texts', 'mess_date', 'msg_clas']) # Add the sentiment to a dataframe.

In [41]:
# Determine the number of positive and negative messages.

num_pos_msg = 0
for msg in msg_clas:
    if msg == 'Positive':
        num_pos_msg += 1

num_neg_msg = 0
for msg in msg_clas:
    if msg == 'Negative':
        num_neg_msg += 1

In [42]:
num_pos_msg/(num_pos_msg + num_neg_msg)*100 # Determine percentage of positive messages.

ZeroDivisionError: division by zero