# Sentiment Analysis - Classifier

This notebook will try to create my own classifier based on news articles which I can then use in one of my projects.
Process:

1. Get training dataset from CSV file
2. Get stopwords from nltk. Edit list according to news articles.
3. Lemmatize the remaining words and only add to list if words are greater than 3 letters and are not stopwords
4. Attach sentiment to the word list
5. For each word, store how many times it was present in positive articles and how many times in negative.
6. Accordingly, train classifier using this data
7. Get new articles and classify them
8. Classify them according to NLTK Sentiment Intensity Analyser
9. Test accuracy by comparing both of them


In [1]:
import pandas as pd
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from afinn import Afinn
from newsapi import NewsApiClient
from nltk.stem import WordNetLemmatizer
import json
from pymongo import MongoClient




In [2]:
sid = SentimentIntensityAnalyzer()
afinn = Afinn()
newsapi = NewsApiClient(api_key='c739ce625cc44a2489a36795b6fbcf7e')
lemmatizer = WordNetLemmatizer()
client = MongoClient('mongodb://kodigo_smu:kodigo123@ds139331.mlab.com:39331/kodigo_smu')
db = client['kodigo_smu']

In [3]:
news_articles_train = db['news_articles_train']

In [4]:
all_articles = pd.DataFrame(list(news_articles_train.find({})))

In [5]:
all_articles.head()

Unnamed: 0,_id,author,content,description,publishedAt,sentiment,source,title,url,urlToImage
0,5bb4df58fd6d92a4a6d9a286,Rhett Jones,"“We’re a culture of builders,” Janelle Gale, F...","“We’re a culture of builders,” Janelle Gale, F...",2018-09-04T23:50:00Z,neutral,"{'id': None, 'name': 'Gizmodo.com'}",Facebook's New Office Looks Like a Tree House ...,https://gizmodo.com/facebooks-new-office-looks...,https://i.kinja-img.com/gawker-media/image/upl...
1,5bb4df58fd6d92a4a6d9a287,ANEMONA HARTOCOLLIS and DANA GOLDSTEIN,"At St. Pauls, an Episcopal boarding school in ...",The allegations against Judge Brett Kavanaugh ...,2018-09-28T21:23:42Z,neutral,"{'id': 'the-new-york-times', 'name': 'The New ...",Schools Are Tackling ‘Bro’ Culture. The Kavana...,https://www.nytimes.com/2018/09/28/us/kavanaug...,https://static01.nyt.com/images/2018/09/27/us/...
2,5bb4df58fd6d92a4a6d9a288,Saqib Shah,A total of 50 episodes have been commissioned ...,Following in the footsteps of Zane Lowe and Ry...,2018-09-21T12:20:00Z,neutral,"{'id': 'engadget', 'name': 'Engadget'}",Spotify taps DJ Semtex for hip-hop culture pod...,https://www.engadget.com/2018/09/21/spotify-dj...,https://o.aolcdn.com/images/dims?thumbnail=120...
3,5bb4df58fd6d92a4a6d9a289,Brian Geddes,"September 7, 2018 5 min read Opinions expresse...",For generations marijuana was a cheerfully out...,2018-09-07T12:00:00Z,neutral,"{'id': None, 'name': 'Greenentrepreneur.com'}",Cannabis Culture Is Fast Becoming Corporate Cu...,https://www.greenentrepreneur.com/article/319598,https://assets.entrepreneur.com/content/3x2/20...
4,5bb4df58fd6d92a4a6d9a28a,Xavier Piedra,"I don't mean to alarm anyone, but it's totally...","I don't mean to alarm anyone, but it's totally...",2018-09-19T15:25:05Z,negative,"{'id': 'mashable', 'name': 'Mashable'}",This weirdly peppy anthem for end times will b...,https://mashable.com/video/bill-wurtz-mount-st...,https://i.amz.mshcdn.com/1a44EASiuhIu8huDbfDbw...


In [6]:
# all_articles_csv = pd.read_csv('data/all_articles_with_sentiment.csv')
# all_articles_csv.head()

Prepare Test Data

In [7]:
articles_list = list(zip(all_articles['title'], all_articles['sentiment']))

In [8]:
#Create stopwords
stopWords = list(stopwords.words('english'))

useful_words = ['but', 'because', 'up', 'down', 'under', 'not', 'only', 'aren', "aren't", 'couldn', "couldn't", 'didn',
                  "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't",
                 "mightn't", "mustn't","needn't", "shan't", "shouldn't", "wasn't",  "weren't", "won't", "wouldn't"]
stopWords = [x for x in stopWords if x not in useful_words]

In [9]:
article_word_list = []
for (words, sentiment) in articles_list:
    words_filtered = [lemmatizer.lemmatize(e).lower() for e in words.split() if len(e) >= 3 if e not in stopWords] 
    article_word_list.append((words_filtered, sentiment))

In [10]:
article_word_list

[(["facebook's",
   'new',
   'office',
   'looks',
   'like',
   'tree',
   'house',
   'built',
   'incompetent',
   'dad'],
  'neutral'),
 (['schools',
   'are',
   'tackling',
   '‘bro’',
   'culture.',
   'the',
   'kavanaugh',
   'case',
   'shows',
   'why',
   'that’s',
   'hard',
   'do.'],
  'neutral'),
 (['spotify', 'tap', 'semtex', 'hip-hop', 'culture', 'podcast'], 'neutral'),
 (['cannabis', 'culture', 'fast', 'becoming', 'corporate', 'culture'],
  'neutral'),
 (['this',
   'weirdly',
   'peppy',
   'anthem',
   'end',
   'time',
   'stuck',
   'head',
   'day'],
  'negative'),
 (['‘murphy', 'brown’', 'returns', 'fight', 'new', 'culture', 'wars'],
  'negative'),
 (['these', 'high', 'school', 'senior', 'took', 'photo', 'next', 'level'],
  'neutral'),
 (['please', 'enjoy', "grandmother's", 'perfect', 'bottle', 'flip'],
  'positive'),
 (['‘the', 'simpsons’', '30:', 'six', 'era-defining', 'episodes'], 'neutral'),
 (['what', 'make', 'apple’s', 'design', 'culture', 'special'], 'p

## Build Classifier

In [11]:
def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)    
    word_features = wordlist.keys()
    return word_features

In [12]:
def get_words_in_articles(articles):
    all_words = []
    for (words, sentiment) in articles:
        all_words.extend(words)
    return all_words

In [13]:
word_features = get_word_features(get_words_in_articles(article_word_list))

In [14]:
def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

In [15]:
training_set = nltk.classify.apply_features(extract_features, article_word_list)

In [16]:
classifier = nltk.NaiveBayesClassifier.train(training_set)

In [17]:
#Print the most informative features using this function
classifier.show_most_informative_features(10)

Most Informative Features
          contains(loss) = True           negati : neutra =     13.3 : 1.0
         contains(great) = True           positi : neutra =     13.2 : 1.0
          contains(made) = True           positi : neutra =     12.2 : 1.0
          contains(risk) = True           negati : neutra =     11.3 : 1.0
        contains(shares) = True           positi : neutra =     11.3 : 1.0
           contains(win) = True           positi : neutra =     11.3 : 1.0
        contains(profit) = True           positi : neutra =      8.3 : 1.0
           contains(top) = True           positi : neutra =      8.3 : 1.0
         contains(death) = True           negati : neutra =      8.0 : 1.0


## Classify

In [18]:
def classify(words):
    return classifier.classify(extract_features(words.split()))

In [19]:
def classify_sia(words):
    ss = sid.polarity_scores(words)
    polarity = ss['compound']
    return 'positive' if polarity>0 else 'negative' if polarity<0 else 'neutral'

## Testing Accuracy

In [20]:
news_results = newsapi.get_everything(q='sustainability', page_size=100)
test_articles = pd.DataFrame(news_results['articles'])

In [21]:
test_articles['sentiment_classifier'] = test_articles['title'].apply(classify)

In [22]:
test_articles['sentiment_sia'] = test_articles['title'].apply(classify_sia)

In [23]:
test_articles.head()

Unnamed: 0,author,content,description,publishedAt,source,title,url,urlToImage,sentiment_classifier,sentiment_sia
0,JAMES BARRON,She said her awareness of environmental concer...,"Sarah Kauss, the founder of S’well, and Mark C...",2018-09-23T14:00:04Z,"{'id': 'the-new-york-times', 'name': 'The New ...","Grace Notes: From a 1993 Vow to 320,000 Reusab...",https://www.nytimes.com/2018/09/23/nyregion/sw...,https://static01.nyt.com/images/2018/09/24/nyr...,neutral,positive
1,Chiara Cecchini,From Singapore to the USA and all around Europ...,"From green tea to kombucha, the humble tea has...",2018-09-06T13:00:00Z,"{'id': None, 'name': 'Makezine.com'}","Edible Innovation: All for Bamboo, All for Sus...",https://makezine.com/2018/09/06/edible-innovat...,https://i2.wp.com/makezine.com/wp-content/uplo...,neutral,positive
2,Brian Heater,It’s been a rough few months for MoviePass. Bu...,It’s been a rough few months for MoviePass. Bu...,2018-09-17T14:33:20Z,"{'id': 'techcrunch', 'name': 'TechCrunch'}",MoviePass competitor Sinemia launches unlimite...,http://techcrunch.com/2018/09/17/moviepass-com...,https://techcrunch.com/wp-content/uploads/2018...,neutral,neutral
3,Seamus Bellamy,Apple has always talked a good game where recy...,Apple has always talked a good game where recy...,2018-09-14T22:42:52Z,"{'id': None, 'name': 'Boingboing.net'}",Apple's claims about recycling and sustainabil...,https://boingboing.net/2018/09/14/apples-claim...,https://media.boingboing.net/wp-content/upload...,negative,negative
4,Megan Rose Dickey,Lyft is on a bit of a Tesla poaching spree whi...,Lyft is on a bit of a Tesla poaching spree whi...,2018-09-14T15:00:17Z,"{'id': 'techcrunch', 'name': 'TechCrunch'}",Lyft hires yet another ex-Tesla employee,http://techcrunch.com/2018/09/14/lyft-hires-ye...,https://techcrunch.com/wp-content/uploads/2018...,neutral,neutral


In [24]:
test_articles_copy = test_articles.copy()
test_articles_copy['accuracy'] = test_articles_copy['sentiment_classifier']==test_articles_copy['sentiment_sia']

In [25]:
accurate = test_articles_copy[test_articles_copy['accuracy']==True]

In [26]:
accurate.shape

(73, 11)

In [27]:
test_articles_copy.shape

(100, 11)

In [29]:
json_object = json.loads(test_articles.to_json(orient='records'))
final_json = {'articles': json_object, 
             'totalResults': news_results['totalResults']}
final_json

{'articles': [{'author': 'JAMES BARRON',
   'content': 'She said her awareness of environmental concerns began before the high-school water fountain and the college pledge. We were the first family on my street to have recycling bins, said Ms. Kauss, who grew up in South Florida. I have a fond memory of my dad fil… [+2460 chars]',
   'description': 'Sarah Kauss, the founder of S’well, and Mark Chambers, the director of the Mayor’s Office of Sustainability, at S’well in Manhattan last week. They are working to slow the stream of discarded plastic bottles that clog waterways, threatening marine life, and t…',
   'publishedAt': '2018-09-23T14:00:04Z',
   'sentiment_classifier': 'neutral',
   'sentiment_sia': 'positive',
   'source': {'id': 'the-new-york-times', 'name': 'The New York Times'},
   'title': 'Grace Notes: From a 1993 Vow to 320,000 Reusable Water Bottles for Every New York High Schooler',
   'url': 'https://www.nytimes.com/2018/09/23/nyregion/swell-water-bottles-nyc-high-schoo