<center>
    <img src="https://weclouddata.s3.amazonaws.com/images/logos/wcd_logo_new_2.png" width='30%'> 
</center>

----------

<h1 align="center"> Twitter Sentiment Analysis Lab </h1>
<br>
<center align="left"> <font size='4'>  Developed by: </font><font size='4' color='#33AAFBD'>WeCloudData</font></center>
<br>

----------

# Sentiment Analysis

> The use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjective information.


Demo: https://azure.microsoft.com/en-ca/services/cognitive-services/text-analytics/

In [1]:
# Libraries for data preparation & visualisation
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(color_codes=True)
import matplotlib.pyplot as plt
%matplotlib inline

# Library to assign sentiment to texts
from textblob import TextBlob

In [2]:
# Reading in data
df = pd.read_csv('tweets.csv')
df.head()

Unnamed: 0,id,created_at,text,location
0,1083193473539420160,2019-01-10 02:47:03,@CIBC please explain to me why I want to remai...,Canada
1,1083191479215026176,2019-01-10 02:39:08,"RT @CIBCLiveLabs: We are pleased to announce, ...","Oshawa, Ontario"
2,1083184422709575683,2019-01-10 02:11:05,CIBC World Markets Inc. Decreases Holdings in ...,The Netherlands
3,1083182915826126848,2019-01-10 02:05:06,Le patron de la Banque @cibc s’attend à un ral...,Montréal
4,1083177871881818112,2019-01-10 01:45:03,Your home is a valuable asset. Use your equity...,"Lower Mainland, BC"


# Getting Labels

To perform sentiment analysis and build a model, we would need to label each tweet with its sentiment. To do this, we could manual label the data ourselves by reading each tweet and assigning a positive or negative sentiment. This would be a long and tedious process. A better solution would be to outsource this process by paying for a service such as Amazon Mechanical Turk.

> Amazon Mechanical Turk (MTurk) is a crowdsourcing marketplace that makes it easier for individuals and businesses to outsource their processes and jobs to a distributed workforce who can perform these tasks virtually. This could include anything from conducting simple data validation and research to more subjective tasks like survey participation, content moderation, and more. MTurk enables companies to harness the collective intelligence, skills, and insights from a global workforce to streamline business processes, augment data collection and analysis, and accelerate machine learning development.

https://www.mturk.com/

For our project, we're going to simulate this process by using `TextBlob` to assign the labels. We'll then train a machine learning model to see if we can get similar performance.

https://textblob.readthedocs.io/en/dev/

In [3]:
# Defining a function to assign sentiments (positive, negative or neutral)

def get_sentiment(text):
    blob = TextBlob(text)
    sentiment = blob.sentiment.polarity
    if sentiment > 0:
        return 'positive'
    elif sentiment < 0:
        return 'negative'
    else:
        return 'neutral'

In [4]:
df['sentiment'] = df['text'].apply(get_sentiment)

In [5]:
df.head()

Unnamed: 0,id,created_at,text,location,sentiment
0,1083193473539420160,2019-01-10 02:47:03,@CIBC please explain to me why I want to remai...,Canada,neutral
1,1083191479215026176,2019-01-10 02:39:08,"RT @CIBCLiveLabs: We are pleased to announce, ...","Oshawa, Ontario",positive
2,1083184422709575683,2019-01-10 02:11:05,CIBC World Markets Inc. Decreases Holdings in ...,The Netherlands,neutral
3,1083182915826126848,2019-01-10 02:05:06,Le patron de la Banque @cibc s’attend à un ral...,Montréal,neutral
4,1083177871881818112,2019-01-10 01:45:03,Your home is a valuable asset. Use your equity...,"Lower Mainland, BC",positive


- Each text has a sentiment attached to it

# Twitter EDA and Feature Engineering

Now that we have our data let's explore and analyze it. This is known as **exploratory data analysis** or **EDA**.

> exploratory data analysis (EDA) is an approach analyzing data sets to summarize their main characteristics, often with visual methods.

https://en.wikipedia.org/wiki/Exploratory_data_analysis

--------

We can then manipulate the data to create **features** for our machine learning model.

> Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive. The need for manual feature engineering can be obviated by automated feature learning. Feature engineering is an informal topic, but it is considered essential in applied machine learning.

"Coming up with features is difficult, time-consuming, requires expert knowledge. "Applied machine learning" is basically feature engineering."

— Andrew Ng, Machine Learning and AI via Brain simulations[1]

https://en.wikipedia.org/wiki/Feature_engineering

## Exercise - Data exploration

- Read in the tweets data using `pandas`
- Explore the data

Some ideas of things to look for:
- the dimensions of the data
- get DataFrame info
- get summary statistics
- get the value counts of categoric columns
- count missing values

In [6]:
# Dimensions of data
df.shape

(1951, 5)

- There are 1951 rows & 5 columns

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1951 entries, 0 to 1950
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          1951 non-null   int64 
 1   created_at  1951 non-null   object
 2   text        1951 non-null   object
 3   location    1509 non-null   object
 4   sentiment   1951 non-null   object
dtypes: int64(1), object(4)
memory usage: 76.3+ KB


- Location has several missing values (442)
- 'id' is a unique random number generated for each row or tweet - this can be dropped from analysis
- 'created_at' of object type & can be converted into datetime datatype
- 'text', 'location' & 'sentiment' are of object type

In [8]:
df.describe(include='all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
id,1951.0,,,,1.0815337016468797e+18,1119671716919130.4,1.0795190073368124e+18,1.0805807548919848e+18,1.0813379960315452e+18,1.0826753209358296e+18,1.08319347353942e+18
created_at,1951.0,1924.0,2019-01-08 20:15:59,4.0,,,,,,,
text,1951.0,1768.0,RT @NHLBlackhawks: Staying warm never looked s...,25.0,,,,,,,
location,1509.0,428.0,Canada,355.0,,,,,,,
sentiment,1951.0,3.0,neutral,846.0,,,,,,,


- The most common sentiment is 'neutral' and it occurs 846 times (out of 1951 records)
- The most common location is 'Canada' which occurs 355 times (out of 1951 records). There are 428 unique records under location
- Some of the texts have frequency > 1, this indicates the texts have been retweeted!

In [9]:
# Check for duplicate records 
df.duplicated().sum().sum()

0

- There are no duplicate records in the data 

In [10]:
df.isnull().sum()

id              0
created_at      0
text            0
location      442
sentiment       0
dtype: int64

- Except for location column, none of the other columns have any missing values

In [12]:
# Value counts of categorical columns
df['sentiment'].value_counts()

neutral     846
positive    800
negative    305
Name: sentiment, dtype: int64

In [13]:
df['location'].value_counts()

Canada                    355
Toronto, Ontario           74
The Caribbean              64
Toronto                    57
Jamaica                    35
                         ... 
Toronto via Elmvale         1
Toronto Ontario Canada      1
646.698.3432                1
Rouen                       1
Kingston, Ontario           1
Name: location, Length: 428, dtype: int64

## Exercise - Create features from datetime

- Check the type of the `created_at` column
- Convert the `created_at` column to a `datetime` type
- Create a new column called `hour` with the hour from `created_at`
- Create a new column called `day_of_week` with the `dayofweek` from `created_at`

In [None]:
df['created_at'] = pd.to_datetime(df['created_at'])

In [None]:
df['hour'] = df['created_at'].dt.hour
df.head()

In [None]:
df['day_of_week'] = df['created_at'].dt.dayofweek # 0 = Monday and 6 = Saturday
df.head()

## Exercise - Dealing with text data

Pandas has many methods for working with text data. We can use these to create features from our tweet text.

A full list of these string methods can be found at: https://pandas.pydata.org/pandas-docs/stable/text.html

- Create a new column called `num_chars` that is the number of characters in each tweet
- Create a new column called `num_words` that is a count of how many words in each tweet 
- Create a new column called `num_ats` that is a count of how many `@` symbols in each

In [None]:
df['text']

In [None]:
df['num_chars'] = df['text'].str.len()
df.head()

In [None]:
df['num_words'] = df['text'].str.count(' ') + 1
#Note 1: Not everything that is separated is a word
#Note 2: +1 is to consider the last word
df.head()

In [None]:
df['num_ats'] = df['text'].str.count('@')
df.head()

## Exercise - Positive and negative words count

Sometimes we might want to use external data to help build features. Here we count how positive and negative words there are in each tweet by comparing them to a predefined list of words.

We borrow our list of pos/neg words from this study: https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html

In [None]:
pos_words = pd.read_csv('twitter-ml-lab/positive-words.txt', skiprows=35, names=['words'])
pos_words = pos_words['words'].values.tolist()
pos_words

In [None]:
neg_words = pd.read_csv('twitter-ml-lab/negative-words.txt', skiprows=35, names=['words'])
neg_words = neg_words['words'].values.tolist()

neg_words

In [None]:
def count_words(tweet, words): #This function should work using pos_words and neg_words
    count = 0
    for word in tweet.split(' '):
        if word in words:
            count += 1
    return count

In [None]:
df['pos_count'] = df['text'].apply(count_words, words = pos_words) #.apply(func_name, parameters)
df.head()

In [None]:
# Same as the line above
lists = []
for text in df['text']:
    lists.append(count_words(text, pos_words))
df['pos_count'] = lists

In [None]:
df['neg_count'] = df['text'].apply(count_words, words = neg_words)
df.head()

## Exercise - Analysis and visualization

- Explore the new **features** we've created
- Try and perform your own analysis and see if you can find any interesting insights
- Create some visualizations to further analyze the data

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df['day_of_week'].value_counts().plot(kind='bar');

In [None]:
df['hour'].value_counts().plot(kind='bar');

In [None]:
df['pos_count'].value_counts().plot(kind='bar');

### Pandas-profiling
1. Consider how many features you have - More features, slower the cell will run
2. Think about if you need all those graphs

## Exercise - Machine Learning

- Get your new features and the label
- Split the data into a training and test set with test_size 0.25 and random_state 2019
- Try training different models and tune their hyperparameters
- Get the model with the highest score

In [None]:
# Imports
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier #model
from sklearn.metrics import classification_report #metric

In [None]:
# Train_test_split
# X = df.drop(columns=['sentiment','id','location','created_at','text'])
X = df[['hour','day_of_week','num_words','num_chars','num_ats','pos_count','neg_count']] #Remember there is two [] for X
y = df['sentiment'] # Only one []

# train_test_split: X, y, and test_size are required parameters, random_state is optional
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=2019)

In [None]:
# Fit and predict
rfc = RandomForestClassifier()

rfc.fit(X_train, y_train) #Both train sets

y_pred = rfc.predict(X_test) #Only X_test, do NOT include y_test

In [None]:
# Not recommended this method
RandomForestClassifier().fit(X_train, y_train)
RandomForestClassifier().predict(X_test)

In [None]:
# Evaluate
print(classification_report(y_test, y_pred))

## Steps of Machine Learning:

1. Imports
2. Load your data
3. EDA - Exploratory Data Analysis
4. Cleaning/Feature Engineering
5. Visualization
6. X, y, train test split
7. Scaling is when there is a very big difference in the values between features
 - Scale is same as model, fit on train, transform on train and test
8. Model (baseline model)
9. Evaluate
10. Do more (feature selection, hyperparameter tuning, feature engineering, ask for more data, etc.)

In [None]:
from sklearn.datasets import load_boston #Other datasets available