# Dictionary based text analysis in Python
## Sentiment analysis
**Thomas Monk**  
**London School of Economics**  
with thanks to Chris Bail, Duke University - converted to Pandas & Python from R

## Introduction

Word-counting techniques and dictionary-based methods are the most simple forms of quantitative text analysis.

This tutorial will cover both of these topics, as well as sentiment analysis, which is a form of dictionary-based text analysis.

### Goal.

I want you to output a single diagram that tells us the following: *did Donald Trump's tweets become more or less negative across his term in office*? 

Perhaps uninspired and outdated at this point! - but the data is so rich and easily accessible, and perfect for this kind of analysis. I want you to think about where this techniques could be useful to you.

## The Data - Tweets

In [2]:
import pandas as pd
import numpy as np
df_raw = pd.read_csv('tweets_01-08-2021.csv') #Kindly provided by The Trump Twitter Archive https://www.thetrumparchive.com/
df_raw = df_raw.sort_values(by=['date'])
df_raw

# I'm going to limit this analysis to 2017 and 2018
df_raw = df_raw[(df_raw['date'] > '2017-01-01')&(df_raw['date'] < '2019-01-01')]
df_raw

Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date,isFlagged
45878,815422340540547073,"TO ALL AMERICANS-#HappyNewYear &amp, many bles...",f,f,Twitter for iPhone,108920,26891,2017-01-01 05:00:10,f
45877,815432169464197121,RT @DanScavino: On behalf of our next #POTUS &...,t,f,Twitter for iPhone,0,4562,2017-01-01 05:39:13,f
45876,815433217595547648,RT @Reince: Happy New Year + God's blessings t...,t,f,Twitter for iPhone,0,5811,2017-01-01 05:43:23,f
45875,815433444591304704,RT @EricTrump: 2016 was such an incredible yea...,t,f,Twitter for iPhone,0,5601,2017-01-01 05:44:17,f
45874,815449868739211265,RT @DonaldJTrumpJr: Happy new year everyone. #...,t,f,Twitter for iPhone,0,5548,2017-01-01 06:49:33,f
...,...,...,...,...,...,...,...,...,...
45882,1079763419908243456,"I’m in the Oval Office. Democrats, come back f...",f,f,Twitter for iPhone,126997,27021,2018-12-31 15:37:14,f
45881,1079763923845419009,It’s incredible how Democrats can all use thei...,f,f,Twitter for iPhone,125636,26560,2018-12-31 15:39:15,f
45880,1079830267274108930,Heads of countries are calling wanting to know...,f,f,Twitter for iPhone,87357,21317,2018-12-31 20:02:52,f
45879,1079830268708556800,"....Senator Schumer, more than a year longer t...",f,f,Twitter for iPhone,75463,17875,2018-12-31 20:02:52,f


**Question 1** We first need to clean the data. What kind of things do we need to get rid of?
- There's a lot of plain text retweets (i.e. in general tweets that contain RT). Remove these.
- URLs are a pain (but we can leave these for now).

In [3]:
# Q1 Solution
df2 = df_raw[~(df_raw['text'].str.contains('RT'))]
display(df2)
# Or - looking at the data!
df = df_raw[df_raw['isRetweet']=='f']
df

Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date,isFlagged
45878,815422340540547073,"TO ALL AMERICANS-#HappyNewYear &amp, many bles...",f,f,Twitter for iPhone,108920,26891,2017-01-01 05:00:10,f
45872,815930688889352192,"Well, the New Year begins. We will, together, ...",f,f,Twitter for Android,105506,23739,2017-01-02 14:40:10,f
45871,815973752785793024,"Chicago murder rate is record setting - 4,331 ...",f,f,Twitter for Android,52993,13992,2017-01-02 17:31:17,f
45870,815989154555297792,"""@CNN just released a book called """"Unpreceden...",f,f,Twitter for Android,11394,3165,2017-01-02 18:32:29,f
45869,815990335318982656,Various media outlets and pundits say that I t...,f,f,Twitter for Android,39567,7264,2017-01-02 18:37:10,f
...,...,...,...,...,...,...,...,...,...
45882,1079763419908243456,"I’m in the Oval Office. Democrats, come back f...",f,f,Twitter for iPhone,126997,27021,2018-12-31 15:37:14,f
45881,1079763923845419009,It’s incredible how Democrats can all use thei...,f,f,Twitter for iPhone,125636,26560,2018-12-31 15:39:15,f
45880,1079830267274108930,Heads of countries are calling wanting to know...,f,f,Twitter for iPhone,87357,21317,2018-12-31 20:02:52,f
45879,1079830268708556800,"....Senator Schumer, more than a year longer t...",f,f,Twitter for iPhone,75463,17875,2018-12-31 20:02:52,f


Unnamed: 0,id,text,isRetweet,isDeleted,device,favorites,retweets,date,isFlagged
45878,815422340540547073,"TO ALL AMERICANS-#HappyNewYear &amp, many bles...",f,f,Twitter for iPhone,108920,26891,2017-01-01 05:00:10,f
45872,815930688889352192,"Well, the New Year begins. We will, together, ...",f,f,Twitter for Android,105506,23739,2017-01-02 14:40:10,f
45871,815973752785793024,"Chicago murder rate is record setting - 4,331 ...",f,f,Twitter for Android,52993,13992,2017-01-02 17:31:17,f
45870,815989154555297792,"""@CNN just released a book called """"Unpreceden...",f,f,Twitter for Android,11394,3165,2017-01-02 18:32:29,f
45869,815990335318982656,Various media outlets and pundits say that I t...,f,f,Twitter for Android,39567,7264,2017-01-02 18:37:10,f
...,...,...,...,...,...,...,...,...,...
45882,1079763419908243456,"I’m in the Oval Office. Democrats, come back f...",f,f,Twitter for iPhone,126997,27021,2018-12-31 15:37:14,f
45881,1079763923845419009,It’s incredible how Democrats can all use thei...,f,f,Twitter for iPhone,125636,26560,2018-12-31 15:39:15,f
45880,1079830267274108930,Heads of countries are calling wanting to know...,f,f,Twitter for iPhone,87357,21317,2018-12-31 20:02:52,f
45879,1079830268708556800,"....Senator Schumer, more than a year longer t...",f,f,Twitter for iPhone,75463,17875,2018-12-31 20:02:52,f


## 2. Corpus creation.
We want to create a corpus of words from these 5396 tweets. This can be a DataFrame of words.


## Text Pre-Processing

Before we begin running quantitative analyses of text, we first need to decide precisely which type of text should be included in our analyses.

For example, as the code above showed, very common words such as "the" are often not very informative.

That is, we typically do not care if one author uses the word "the" more often than another in most forms of quantitative text analysis, but we might care a lot about how many times a politician uses the word "economy" on Twitter.

The pre-processing we do is is definitely not perfect! You'll spot loads of problems - let's just get outselves to a semi-clean corpus.

### Question 2a
Create a pre-processing function to use with Pandas apply that does the following task.

**Hashtags and @s!**
Remove these before anything else (they are words that start with # or @) - also remove '&amp' if you see it.

**Case**

First, force all text to be lowercase. Do we want “Economy” to be counted as a different word than “economy”? Probably. What about “God”, and “god”? That one is much less straightforward. Nevertheless, it has become commonplace to force all text into lower case in quantitative text analysis.

**Stopwords**

Common words such as "the", "and", "but", "for", "is", etc. are often described as "stop words," meaning that they should not be included in a quantitative text analysis.
Remove these words from each tweet. I've included a full list of stopwords for you to use below.

**Removing whitespace**

Often, a single white space or group of whitespaces can also be considered to be a “word” within a corpus. Just FYI the code below will remove multiple spaces from inside a string.

**Removing numbers**

In many texts, numbers can carry significant meaning. Consider, for example, a text about the 4th of July. On the other hand, many numbers add little to the meaning of a text, and so it has become commonplace in the field of natural language processing to remove them from an analysis.

**Punctuation**

Another common step in pre-processing text is to remove all punctuation marks. This is generally considered important, since to an algorithm the punctuation mark “,” will assume a unique numeric identity just like the term “economy.” It is often therefore advisable to remove punctuation marks in an automated text analysis, but there are also a number of cases where this can be problematic. Consider the phrase, “Let’s eat, Grandpa” vs. “Lets eat Grandpa.”

Python comes with a built in string library with a full punctuation list - see below.

**URLs**
There's a lot of links in tweets - these aren't words. We can remove them easily too with the re package - see below.

In [8]:
# Remove multiple spaces
import re
print(re.sub(' +', ' ', 'The     quick brown    fox'))

# Punctuation in Python
import string
print(string.punctuation)
# Misses ’!
punc_list = string.punctuation + '‘’“”–' # missed a few

# Remove URLs
print(re.sub('https?://[A-Za-z0-9./]+', '', 'http://www.google.com Hello')) # remove URLs

The quick brown fox
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
 Hello


In [6]:
# From the NLTK package
# import nltk
# nltk.download('stopwords')
# from nltk.corpus import stopwords
# stopwords = stopwords.words('english')
# with some additions
stopwords = ["a","about","above","after","again","against","all","am","an","and","any","are","aren't","as","at","be","because","been","before","being","below","between","both","but","by","can't","cannot","could","couldn't","did","didn't","do","does","doesn't","doing","don't","down","during","each","few","for","from","further","had","hadn't","has","hasn't","have","haven't","having","he","he'd","he'll","he's","her","here","here's","hers","herself","him","himself","his","how","how's","i","i'd","i'll","i'm","i've","if","in","into","is","isn't","it","it's","its","itself","let's","me","more","most","mustn't","my","myself","no","nor","not","of","off","on","once","only","or","other","ought","our","ours 	ourselves","out","over","own","same","shan't","she","she'd","she'll","she's","should","shouldn't","so","some","such","than","that","that's","the","their","theirs","them","themselves","then","there","there's","these","they","they'd","they'll","they're","they've","this","those","through","to","too","under","until","up","very","was","wasn't","we","we'd","we'll","we're","we've","were","weren't","what","what's","when","when's","where","where's","which","while","who","who's","whom","why","why's","with","won't","would","wouldn't","you","you'd","you'll","you're","you've","your","yours","yourself","yourselves"]

In [10]:
# Solution to Q2
import string
import re
def preprocess(tweet):
    ret = tweet.lower() # lowercase
    ret_check = ret
    ret = ''
    for word in ret_check.split(): # remove punctuation and numbers
        if word.find('#')==-1 and word.find('@')==-1 and word.find('&amp')==-1 and word not in stopwords:
            ret =  ret + word + ' '
    ret = re.sub('https?://[A-Za-z0-9./]+', ' ', ret) # remove URLs
    remove_char_list = string.punctuation + '‘’“”–1234567890'
    for character in remove_char_list: # remove punctuation and numbers
        ret = ret.replace(character, ' ')
    ret = ret.strip() # remove leading and trailing whitespace
    ret = re.sub(' +', ' ', ret) # remove multiple spaces
    return ret

display(df['text'].head(5))
df['text'].head(5).apply(preprocess)

45878    TO ALL AMERICANS-#HappyNewYear &amp, many bles...
45872    Well, the New Year begins. We will, together, ...
45871    Chicago murder rate is record setting - 4,331 ...
45870    "@CNN just released a book called ""Unpreceden...
45869    Various media outlets and pundits say that I t...
Name: text, dtype: object

45878    to all many blessings to you all looking forwa...
45872    well the new year begins we will together make...
45871    chicago murder rate is record setting shooting...
45870    just released a book called unprecedented whic...
45869    various media outlets and pundits say that i t...
Name: text, dtype: object

## Question 3 - Stemming
A final common step in text-pre processing is stemming. Stemming a word refers to replacing it with its most basic conjugate form. For example the stem of the word “typing” is “type.” Stemming is common practice because we don’t want the words “type” and “typing” to convey different meanings to algorithms that we will soon use to extract latent themes from unstructured texts.

The code below shows how to do this in Python - run it first! We use the WordNet Lemmatizer from the NLTK package. I won't go into detail here, as this can get very complex!

Use the function below within the function you built above.

After this, save your cleaned dataframe down, with a column called textp with the processed tweets in them, as a csv called processed_tweets.csv. Move on to **Part B**.

In [4]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger')
  
lemmatizer = WordNetLemmatizer()

print("rocks :", lemmatizer.lemmatize("rocks"))
print("corpora :", lemmatizer.lemmatize("corpora"))

def get_wordnet_pos(word): # Need to ensure we lemmatize the word correctly - e.g. 'is' is a verb, not a noun
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)
  
  
# a denotes adjective in "pos"
word = 'better'
print(lemmatizer.lemmatize(word, get_wordnet_pos(word)))


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\monkt\AppData\Roaming\nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\monkt\AppData\Roaming\nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\monkt\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


rocks : rock
corpora : corpus
well


In [8]:
# Solution to Q3
import string
import re

def preprocess(tweet):
    ret = tweet.lower() # lowercase
    ret_check = ret
    ret = ''
    for word in ret_check.split(): # remove punctuation and numbers and stopwords
        if word.find('#')==-1 and word.find('@')==-1 and word.find('&amp')==-1 and word not in stopwords:
            ret =  ret + word + ' '
    ret = re.sub('https?://[A-Za-z0-9./]+', ' ', ret) # remove URLs
    remove_char_list = string.punctuation + '‘’“”–1234567890'
    for character in remove_char_list: # remove punctuation and numbers
        ret = ret.replace(character, ' ')
    ret = ret.strip() # remove leading and trailing whitespace
    ret = re.sub(' +', ' ', ret) # remove multiple spaces
    ret_full = ''
    for word in ret.split():
        lem_word = lemmatizer.lemmatize(word, get_wordnet_pos(word))
        if len(word)>1:
            ret_full = ret_full + lem_word + ' '
    return ret_full

display(df['text'].head(5))
df['textp'] = df['text'].head(5).apply(preprocess)
df['textp']
#df.to_csv('processed_tweets.csv',index=False)

45878    TO ALL AMERICANS-#HappyNewYear &amp, many bles...
45872    Well, the New Year begins. We will, together, ...
45871    Chicago murder rate is record setting - 4,331 ...
45870    "@CNN just released a book called ""Unpreceden...
45869    Various media outlets and pundits say that I t...
Name: text, dtype: object

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['textp'] = df['text'].head(5).apply(preprocess)


45878    many blessing all look forward wonderful prosp...
45872    well new year begin will together make america...
45871    chicago murder rate record set shoot victim mu...
45870    just release book call unprecedented explores ...
45869    various medium outlet pundit say thought go lo...
                               ...                        
45882                                                  NaN
45881                                                  NaN
45880                                                  NaN
45879                                                  NaN
42123                                                  NaN
Name: textp, Length: 5396, dtype: object