<a href="https://colab.research.google.com/github/rahiakela/practical-natural-language-processing/blob/master/8-social-media/04_understanding_twitter_sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Understanding Twitter Sentiment

When it comes to NLP and social media, one of the most popular applications has to be sentiment analysis. For businesses and brands across the globe, it’s crucial to listen to what people are saying about them and their products and services. It’s even more important to know whether people’s opinion is positive or negative and if this sentiment polarity is changing over time. 

In the pre-social era, this was done using customer surveys, including door-to-door visits. In today’s world, social media is a great way to understand people’s sentiment about a brand. Even more important is how
this sentiment changes over time.

<img src='https://github.com/rahiakela/img-repo/blob/master/practical-nlp/8-10.png?raw=1' width='800'/>

In this notebook, we’ll focus on building sentiment analysis for Twitter data using a dataset from the public domain.

How is sentiment analysis for Twitter different from the sentiment analysis models? The key difference lies in the dataset.On the other hand, the data in the Twitter sentiment corpus consists of tweets written informally.

This leads to the various issues.These issues, in turn, impact the performance of the model.

We’ll move forward by building a system for sentiment analysis and setting up a baseline. For this, we’ll use TextBlob, which is a Python-based NLP toolkit built on top of NLTK and Pattern. It comes with an array of modules for text processing, text mining, and text analysis. All it takes is five lines of code to get a basic sentiment classifier.

##Setup

In [None]:
import pandas as pd
from textblob import TextBlob

In [None]:
!wget -q https://raw.githubusercontent.com/rahiakela/practical-natural-language-processing/master/8-social-media/data/sts_gold_tweet.csv

## loading the dataset

In [None]:
df = pd.read_csv("sts_gold_tweet.csv",error_bad_lines=False,delimiter=";")
df.head()

In [None]:
print(df.columns)

## Defining the baseline

In [None]:
# make a list of all the tweets
tweets_text_collection = list(df["tweet"])

In [None]:
for tweet_text in tweets_text_collection:
  print(tweet_text)
  analysis = TextBlob(tweet_text) 
  print(analysis.sentiment)   # analyse the sentitment
  '''
  Polarity is a value between [-1.0, 1.0] and tells how positive or negative the text is. 
  Subjectivity is within the range [0.0, 1.0] where 0.0 is very objective and 1.0 is very subjective.
  '''
  print("-"*20)

It uses a simple idea: tokenize the tweet and compute polarity and subjectivity for each of the tokens. Then combine the polarity and subjectivity numbers to arrive at a single value for the whole sentence. We leave it to the reader to get into the finer details. This simple sentiment classifier might not work well, primarily because of the tokenizer used by TextBlob. Our data comes from social media, so it will most likely not follow formal English. Thus, after tokenization, many of the tokens may not be standard words found in the English dictionary, so we won’t have the polarity and subjectivity for all such tokens.

Say we’ve been asked to improve our classifier. We can try various techniques and algorithms.However, we might not see a great improvement
in performance because of the noise present in the data.Thus, the key to improving the system lies in better cleaning and pre-processing of the text data. This is crucial for SMTD.

**Pre-processing and data cleaning are crucial when working with SMTD. This step is likely to provide the most gains in model performance.**