# Sentiment Analysis of Climate Change Tweets

Before getting at the sentiment analysis in tweets related to climate change let's see a simple example.\
Here is a classic problem from statistics class. Suppose you have two unmarked bowls of cookies. Bowl1 has 30 Oreos and 10 Chips Ahoy! The other bowl, Bowl2, contains 20 Oroes and 20 Chips Ahoy! So many questions follow from this simple setup.

* If I do pick from bowl1 what are the chances that I’ll pick an Oreo (my favorite!)
* If I pick from bowl2 what are the chances that the cookie is an Oreo?
* If I pick from a random bowl and I get an Oreo what are the chances that I picked the cookie out of bowl1 versus bowl2

Let's illustrate one of the most important mathematical formulas of the data science era – Bayes Theorem of conditional probability. We can state the theorem as follows:

$$P\left(A|B\right) = \frac{P\left(B|A\right)\cdot P(A)}{P(B)}$$ 

We would say that the probability of A given B is the probability of B given A times the probability # of A divided by the probability of B.
Luckily a lot of probability just boils down to couting and dividing. The spreadsheet below shows you how to use a table of data to calculate everything you need.

| Cookie                | Bowl1   | Bowl2 | Total |
|-----------------------|:-------:|:-----:|------:|
| Oreo                  | 30      | 20    | 50    |
| Chips Ahoy!           | 10      | 20    | 30    |
| **Total**             | 40      | 40    | 80    |

The probability table will be:

| Probability           |         | Value |
|-----------------------|---------|------:|
| P(Bowl1)              | = 40/80 |  0.5  |
| P(Bowl2)              | = 40/80 |  0.5  |
| P(Oreo)               | = 50/80 | 0.625 |
| P(ChipsAhoy!)         | = 30/80 | 0.375 |
| P(Oreo \| Bowl1)      | = 30/40 |  0.75 |
| P(Oreo \| Bowl2)      | = 20/40 |  0.5  |
| P(ChipsAhoy! \| Bowl1)| = 10/40 |  0.25 |
| P(chipsAhoy! \| Bowl2)| = 20/40 |  0.5  |
| P(Bowl1 \| Oreo)      | = 30/50 |  0.6  |
| P(Bowl2 \| Oreo)      | = 20/50 |  0.4  |
| P(Bowl1 \| ChipsAhoy!)| = 10/30 | 0.333 |
| P(Bowl2 \| ChipsAhoy!)| = 20/30 | 0.666 |

Focusing on the problem of deciding if we chose from Bowl1 or Bowl2 for a moment. If we pick out an Oreo that means there is a 60% chance it came from Bowl1 and a 40% chance that it came from Bowl2. That doesn’t give us a ton of confidence that we have the right bowl. But what if we gather more data? What if we put the cookie back, carefully stir the cookies around and then pick another one. If this one comes out as an Oreo how can we use that information to improve our guess about which bowl we chose from?

It turns out that it does, the more evidence we get the better we are able to predict the Bowl. When we go down this road we are going to take a bit of a mathematical shortcut so that our answer will not be a probability anymore, but thats OK as our end goal is to build a classifier that as an algorithm just given some data tells us whether something is one thing or another. For example given Oreo, Oreo, Oreo, Chips Ahoy! It is most likely that the bow we were picking from is Bowl1.

The way to think about this is what is the probability of it being bowl one given Oreo, Oreo, Oreo, Chips Ahoy! Or to state it mathematically:

$$P\left(C|x_1,x_2,x_3...x_n\right)$$ 

It turns out that this is proportional to:

$$P\left(x_1,x_2,x_3...x_n\right|C)\cdot P\left(C\right)$$ 

Now we can combine the individual probabilities using multiplication. So the above statement is again proportional to:

$$P\left(C_J\right)\cdot \prod_{i}^n P\left(x_i|C_j\right)$$

Now if we compute that formula for each possible Cj then the one with the higest value is our winner.

Lets work out the example we have outlined to get the scores given our Oreo, Oreo, Oreo, Chips Ahoy! example. The probability that we get an Oreo given that it is Bowl1 is 0.75 and the probability that it is a Chips Ahoy! given that it is Bowl1 is 0.25 The probability that a cookie comes from Bowl1 is 0.5 thus:

$$Bowl1 = P(Bowl1)\cdot P\left(Oreo|Bowl1\right)\cdot P\left(Oreo|Bowl1\right)\cdot P\left(Oreo|Bowl1\right)
\cdot P\left(Ahoy!|Bowl1\right)$$

$$Bowl2 = P(Bowl2)\cdot P\left(Oreo|Bowl2\right)\cdot P\left(Oreo|Bowl2\right)\cdot P\left(Oreo|Bowl2\right)
\cdot P\left(Ahoy!|Bowl2\right)$$ 

In [38]:
Bowl1 = round((0.5 * 0.75 * 0.75 * 0.75 * 0.25), 3)
Bowl2 = round((0.5 * 0.5 * 0.5 * 0.5 * 0.5), 3)

print(f'Bowl1 = {Bowl1}, Bowl2 = {Bowl2}')

Bowl1 = 0.053, Bowl2 = 0.031


Since Bowl1 value is higher than Bowl2 value, we can say that we were picking-up cookies out of Bowl1.

If we modify the spreadsheet so that the number of chips ahoy in Bowl1 is 40, and the number of oreos in Bowl2 is 30, given our Oreo, Oreo, Oreo, Chips Ahoy! example. What are the new scores for Bowl1?

| Cookie                | Bowl1   | Bowl2 | Total |
|-----------------------|:-------:|:-----:|------:|
| Oreo                  | 30      | 30    | 60    |
| Chips Ahoy!           | 40      | 20    | 60    |
| **Total**             | 70      | 50    | 120   |

The probability table will be:

| Probability           |         | Value |
|-----------------------|---------|------:|
| P(Bowl1)              |= 70/120 | 0.583 |
| P(Bowl2)              |= 50/120 | 0.416 |
| P(Oreo)               |= 60/120 |  0.5  |
| P(ChipsAhoy!)         |= 60/120 |  0.5  |
| P(Oreo \| Bowl1)      | = 30/70 | 0.428 |
| P(Oreo \| Bowl2)      | = 30/50 |  0.6  |
| P(ChipsAhoy! \| Bowl1)| = 40/70 | 0.571 |
| P(chipsAhoy! \| Bowl2)| = 20/50 |  0.4  |
| P(Bowl1 \| Oreo)      | = 30/60 |  0.5  |
| P(Bowl2 \| Oreo)      | = 30/60 |  0.5  |
| P(Bowl1 \| ChipsAhoy!)| = 40/60 | 0.666 |
| P(Bowl2 \| ChipsAhoy!)| = 20/60 | 0.333 |

In [39]:
Bowl1 = round((0.428 * 0.428 * 0.428 * 0.571 * 0.583), 3)
Bowl2 = round((0.6 * 0.6 * 0.6 * 0.4 * 0.416), 3)

print(f'Bowl1 = {Bowl1}, Bowl2 = {Bowl2}')

Bowl1 = 0.026, Bowl2 = 0.036


Since Bowl2 value is higher than Bowl1 value, we can say that we were picking-up cookies out of Bowl2.

Now lets add a third kind of cookie to both bowls. Suppose we had a bunch of Fig Newtons. 20 of them in Bow1 and 30 of them in Bowl2 and we have the following series of draws: Oreo, Fig Newton, Fig Newton, Chips Ahoy, Oreo. What are the new scores for Bowl1 and Bowl2?

| Cookie                | Bowl1   | Bowl2 | Total |
|-----------------------|:-------:|:-----:|------:|
| Oreo                  | 30      | 30    | 60    |
| Chips Ahoy!           | 40      | 20    | 60    |
| Fig Newtons           | 20      | 30    | 50    |
| **Total**             | 90      | 80    | 170   |

The probability table will be:

| Probability           |         | Value |
|-----------------------|---------|------:|
| P(Bowl1)              |= 90/170 | 0.529 |
| P(Bowl2)              |= 80/170 | 0.471 |
| P(Oreo)               |= 60/170 | 0.353 |
| P(ChipsAhoy!)         |= 60/170 | 0.353 |
| P(Fig Newtons)        |= 50/170 | 0.294 |
| P(Oreo \| Bowl1)      | = 30/90 | 0.333 |
| P(Oreo \| Bowl2)      | = 30/80 | 0.375 |
| P(ChipsAhoy! \| Bowl1)| = 40/90 | 0.444 |
| P(chipsAhoy! \| Bowl2)| = 20/80 | 0.25  |
| P(Fig Newtons\| Bowl1)| = 20/90 | 0.222 |
| P(Fig Newtons\| Bowl2)| = 30/80 | 0.375 |
| P(Bowl1 \| Oreo)      | = 30/60 |  0.5  |
| P(Bowl2 \| Oreo)      | = 30/60 |  0.5  |
| P(Bowl1 \| ChipsAhoy!)| = 40/60 | 0.666 |
| P(Bowl2 \| ChipsAhoy!)| = 20/60 | 0.333 |
| P(Bowl1\| Fig Newtons)| = 20/50 |  0.4  |
| P(Bowl2\| Fig Newtons)| = 30/50 |  0.6  |


In [40]:
Bowl1 = round((0.529 * 0.333 * 0.222 * 0.222 * 0.444* 0.333), 3)
Bowl2 = round((0.471 * 0.375 * 0.375 * 0.375 * 0.25), 3)

print(f'Bowl1 = {Bowl1}, Bowl2 = {Bowl2}')

Bowl1 = 0.001, Bowl2 = 0.006


Since Bowl2 value is higher than Bowl1 value, we can say that we were picking-up cookies out of Bowl2.

## Going from Cookies to Tweets

This all gets much more interesting when we look at a more real world problem. In fact this kind of Bayesian Classification became extremely popular 20 years ago as the first spam filter for email that worked well. More recently it has become a good technique for doing sentiment analysis.\
Instead of bowls of cookies we have bags of words. One bag has all the words we have collected from millions of emails that users have marked as spam. The other bag contains all the words we have collected from emails that were not spam. We can build a table just like we did for our Oreo and Chips Ahoy example. Of course this will have a lot more rows as we have a much greater variety. Nevertheless we can count how many times each word occurs in our spam bag and how many times it occurs in the non-spam bag. And compute our probabilities from there.\
To start with, we have a bunch of tweets that have been categorized as either climate change is real, and tweets that are of the climate change is fake variety. We will use those to build our two bags of words. There are also a bunch of tweets that have categorized as neutral, but we will leave those apart and focus on the two extremes.

In [41]:
import pandas as pd
import re
from nltk.corpus import stopwords

df_tweets = pd.read_csv('climate_tweets.csv')
df_tweets

Unnamed: 0,tnum,tweet,existence
0,3229,RT @TIME: Another blizzard: What happened to g...,N
1,4263,Another crooked scientist in global warming sc...,N
2,168,Ski resorts fight global warming|SALT LAKE CIT...,Y
3,4708,D.C. Snowstorm: How Global Warming Makes Blizz...,Y
4,3766,[@ClimateProgress] Energy and Global Warming N...,Y
...,...,...,...
1560,3,Carbon offsets: How a Vatican forest failed to...,Y
1561,4730,"Enough with the ""Where's your global warming n...",Y
1562,3765,Rich Galen: Is global warming another DC snow ...,N
1563,5596,Counting down to the World People's Conf on #C...,Y


### Step 1 Cleaning the Data
Before 'dealing' with every tweet, cleaning is necessary. This is because some tweets have links and punctuation that would alternate a word at the moment to make the analysis. Finally, we'll lowercase every word, this is because the algorithm would be case-sensitive.

1. Remove URLs and punctuation from tweet
2. Convert all to lower case

In [42]:
def removePunctuation(a_str:str)->str:
    ''' Remove URL, punctuation and lowercase a str'''
    new_str = re.sub(r'http:\S*','',a_str) # Remove URL based on http:...
    new_str = re.sub(r'[^\w\s]','',new_str).lower() # Remove punctuation and lowercasing the string
    return new_str

In [43]:
df_tweets['tweet'] = df_tweets['tweet'].map(removePunctuation)
# This could be done in a single line using maps and lambda function as:
# df_tweets['tweet'] = (df_tweets['tweet'].map(lambda x: re.sub(r'http:\S*','',x)
#                                           .map(lambda x: re.sub(r'[^\w\s]','',x)).lower()
df_tweets

Unnamed: 0,tnum,tweet,existence
0,3229,rt time another blizzard what happened to glob...,N
1,4263,another crooked scientist in global warming sc...,N
2,168,ski resorts fight global warmingsalt lake city...,Y
3,4708,dc snowstorm how global warming makes blizzard...,Y
4,3766,climateprogress energy and global warming news...,Y
...,...,...,...
1560,3,carbon offsets how a vatican forest failed to ...,Y
1561,4730,enough with the wheres your global warming now...,Y
1562,3765,rich galen is global warming another dc snow j...,N
1563,5596,counting down to the world peoples conf on cli...,Y


### Step 2 Building the Model
1. Make a Dictionary for climate change existence and a Dictionary for climate change denial:
    * For each tweet split the string into a list of words and add those words to the appropriate counter, based on the existence column. Do not include so called stop-words that is words that are popular and used in all tweets, such as a, an, the, etc.
2. Make a dataframe that includes all of the words from both dictionaries where a word appears in both counters this dataframe should contain the total count.
    * It should look like:

<div align="center">

| Word    |  Y_counts  |  N_counts  | Total_count|
|---------|:----------:|:----------:|-----------:|
| Global  |    2271    |    2167    |    4438    |
</div>

If we use:\
`df_tweets['existence'].unique()`\
we'll see that there are just two values 'Y' and 'N'.
So let's create two series for each category using a boolean mask

In [44]:
serie_denial = pd.Series(df_tweets['tweet'][df_tweets['existence'] == 'N'])
serie_denial

0       rt time another blizzard what happened to glob...
1       another crooked scientist in global warming sc...
11      rt yidwithlid the ipccs latest climate change ...
18      global warming in the hot seatby keith yost st...
23      someone go tell the climate change crowd to go...
                              ...                        
1540    new_federalists  i have it on good auth tht gl...
1551    rt keder    rt bglscout climate change weather...
1552                  jmac82 so much for global warming p
1558    climate change fraud  the scandal of solar pow...
1562    rich galen is global warming another dc snow j...
Name: tweet, Length: 323, dtype: object

In [45]:
serie_approval = pd.Series(df_tweets['tweet'][df_tweets['existence'] == 'Y'])
serie_approval

2       ski resorts fight global warmingsalt lake city...
3       dc snowstorm how global warming makes blizzard...
4       climateprogress energy and global warming news...
5       markey presses coal ceos on climate change den...
6       glacial melt from global warming could unplug ...
                              ...                        
1559    los angeles jobs  ft work for greenpeace to st...
1560    carbon offsets how a vatican forest failed to ...
1561    enough with the wheres your global warming now...
1563    counting down to the world peoples conf on cli...
1564    bcbg25  its the denial about alternative energ...
Name: tweet, Length: 1242, dtype: object

A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.\
We would not want these words to take valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to stop words. NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16 different languages.

In [46]:
stop_words = stopwords.words('english')
serie_denial = serie_denial.apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
serie_approval = serie_approval.apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))

In [47]:
# Now we can create the dictionaries
dict_denial = {}
for tweet in serie_denial:
    for word in tweet.split():
        dict_denial[word] = dict_denial.get(word, 0) + 1

dict_approval = {}
for tweet in serie_approval:
    for word in tweet.split():
        dict_approval[word] = dict_approval.get(word, 0) + 1

In [48]:
# The final DataFrame "merging" both of dictionaries - the NaN values must be replace with 0 value
df_words = pd.DataFrame([dict_approval, dict_denial], index=['Y_counts', 'N_counts'])
df_words = df_words.transpose().reset_index(names='word').fillna(0)
df_words

Unnamed: 0,word,Y_counts,N_counts
0,ski,6.0,0.0
1,resorts,6.0,0.0
2,fight,48.0,0.0
3,global,648.0,268.0
4,warmingsalt,3.0,0.0
...,...,...,...
4309,jmac82,0.0,1.0
4310,spain,0.0,1.0
4311,rich,0.0,1.0
4312,galen,0.0,1.0


### Laplace Smoothing
If we take a look at the previous DataFrame we'll notice that there are some zero values, which is known as ***zero probability error***, and can be solve using Laplace smoothing, a technique for smoothing categorical data.
A small-sample correction, or pseudo-count, will be incorporated in every probability estimate. Consequently, no probability will be zero. This is a way of regularizing Naive Bayes, and when the pseudo-count is zero.
Thus, for a word ***w'*** that has a value of zero, could be represent as:

$$P\left(w'|Ycount\right) = \frac{\text{number of tweets with } w' \text{ and } y \text{ = positive} +\alpha}{N +\alpha *K}$$

***alpha*** represents the smoothing parameter, **K** represents the number of dimensions (features) in the data, and **N** represents the number of reviews with y=positive. If we choose a value of alpha != 0 (not equal to 0), the probability will no longer be zero even if a word is not present in the training dataset.\
As ***alpha*** increases, the likelihood probability moves towards uniform distribution (0.5). Most of the time, alpha = 1 is being used to remove the problem of zero probability.\
Sometimes Laplace smoothing technique is also known as “Add one smoothing”. In Laplace smoothing, 1 (one) is added to all the counts, and thereafter, the probability is calculated. This is one of the most trivial smoothing techniques out of all the techniques.

In [49]:
# Adding 1 to all the values in the dataframe
df_words[['Y_counts', 'N_counts']] = df_words[['Y_counts', 'N_counts']].apply(lambda x: x+1)
# Getting the total values for every word
df_words['Total_count'] = df_words['Y_counts'] + df_words['N_counts']
df_words

Unnamed: 0,word,Y_counts,N_counts,Total_count
0,ski,7.0,1.0,8.0
1,resorts,7.0,1.0,8.0
2,fight,49.0,1.0,50.0
3,global,649.0,269.0,918.0
4,warmingsalt,4.0,1.0,5.0
...,...,...,...,...
4309,jmac82,1.0,2.0,3.0
4310,spain,1.0,2.0,3.0
4311,rich,1.0,2.0,3.0
4312,galen,1.0,2.0,3.0


If we take a look at the dataframe above, we'll see that seems like our probability cookies tables, so now we can get the probability of any word coming from a denial or an approval tweet.\
Let's define a function that shows us information about the probabilities according to a word provided by a user.

In [50]:
def whatChances(a_df:pd.DataFrame, a_str:str)->tuple:
    '''Will return a tupple with the probabilities that a specified word comes from a denial or an approval tweet, 
    approval on position 0 and denial on position 1'''
    appr_str = float(a_df['Y_counts'][a_df['word'] == a_str])
    deni_str = float(a_df['N_counts'][a_df['word'] == a_str])
    approval_words = a_df['Y_counts'].sum()
    denial_words = a_df['N_counts'].sum()
    appr = round((appr_str/approval_words), 5)
    deni = round((deni_str/denial_words), 5)
    return (appr,deni)

In [51]:
whatChances(df_words, 'fake')

(0.00011, 0.00025)

In [52]:
whatChances(df_words, 'world')

(0.00208, 0.0005)

## Classifying new Tweets
Now let's figure it out how to classify a tweet using the Naive Bayes algorithm based and the previous dataframe as our training model.

In [53]:
# Loading the dataset and removing punctuation, stop-words and lowercasing: 
df_climate = pd.read_csv('climate_test.csv')
df_climate['tweet'] = df_climate['tweet'].map(removePunctuation)
df_climate['tweet'] = df_climate['tweet'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
df_climate

Unnamed: 0,tnum,tweet
0,1202,earthday aware consume waste treat place 1 see...
1,1100,government report says global warming may caus...
2,580,cleaner air could speed global warming link
3,2871,despite sceptics climate change must remain pr...
4,101,rt disturbedwater climate change increases hea...
...,...,...
517,585,global warming today à_ blog archive à_ tackle...
518,5379,icecovered volcanoes may answer climate change...
519,3306,seriously libs really reaching ha rt drudge_re...
520,4693,dear tcot rt newsongreen dc snowstorm global w...


In [54]:
# Let's re-write our previous function
def existenceChances(a_str)->str:
    '''Will clasify the tweet according to the words in there'''
    yscore = 1.0
    nscore = 1.0
    for word in a_str.split():
        if word in list(df_words['word']):
            yscore = yscore * whatChances(df_words, word)[0]
            nscore = nscore * whatChances(df_words, word)[1]
        else: 
            continue
    if yscore > nscore:
        return 'Y'
    else:
        return 'N'

In [55]:
df_climate['existence'] = df_climate['tweet'].map(existenceChances)
df_climate.sample(15)

Unnamed: 0,tnum,tweet,existence
179,1270,pat mooney dangers geoengineering manipulating...,Y
181,5078,sec recognizes climate change material busines...,Y
337,1642,new_federalists good auth tht global warming a...,N
134,864,soopermexican global warming clearly,N
290,4930,would beneficiaries global warming hype find h...,N
21,2184,grapes best earlywarning system effects climat...,Y
46,1511,soaring mercury blame global warmingagartala a...,Y
68,271,seasonal allergies getting worse climate chang...,Y
166,6028,dr_rose cali getting strange weather year call...,Y
471,1101,rt disturbedwater climate change increases hea...,Y


## Conclusions
* This kind of sentiment analysis is quite 'simple' and works pretty well as we could see.
* We must be careful in terms that not all the words are very well classified, and even using Laplace Smoothing, we should analyze if there are more words in our Bayesian model or words to 'smoothing'.
* In machine learning, this kind of analysis is close to the supervised model. It is defined by its use of labeled datasets to train algorithms that classify data or predict outcomes accurately. As input data is fed into the model, it adjusts its weights until the model has been fitted appropriately, which occurs as part of the cross-validation process.