In [99]:
#Data Science and MISA Zimbabwe

##Natural Language Processing of Twitter Data from the Run-up to the 2018 Elections

### Notebook by [Joseph Noko](josephnoko@gmail.com)

In [100]:
#**It is recommended to [view this notebook in nbviewer](http://nbviewer.ipython.org/github.com/theobstacleistheway/Noko-Thonje-Company/blob/master/Data%20Science%20and%20MISA%20Zimbabwe.ipynb) for the best viewing experience.**

SyntaxError: invalid syntax (<ipython-input-100-a717203f8fbf>, line 1)

In [124]:
## Introduction
    
The purpose of this Jupyter Notebook presentation is to introduce MISA Zimbabwe to data science. Data science is the study of the
extraction of knowledge from data. It uses various techniques from many fields, including signal processing, mathematics,
probability models, machine learning, computer programming, statistics, data engineering, pattern matching, data visualization, 
uncertainty modeling, data warehousing, and high performance computing with the goal of extracting useful knowledge from the data.
Data Science is not restricted to only big data, although the fact that data is scaling up makes big data an important aspect of 
data science. 

Data scientists solve complicated data problems using mathematics, statistics and computer science, although very good skill in 
these subjects are not required.[1] However, a data scientist is most likely to be an expert in only one or two of these 
disciplines, meaning that cross disciplinary teams can be a key component of data science.

Good data scientists are able to apply their skills to achieve a broad spectrum of end results. The skill-sets and competencies 
that data scientists employ vary widely.  

We believe that by using data science, MISA Zimbabwe can learn more about how ZImbabweans are utilising what 
freedom of expression they have, and learn about the various issues that Zimbabweans who use social media, are discussing. In order
to achieve this aim, we will go through a standard and basic workflow to demonstrate ways in which data science can be used. The 
subject of this workflow will be the recently passed general elections. 

Researchers have used Twitter for a huge range of subjects: [measuring how people cope with global crises and their aftermaths](https://www.unglobalpulse.org/projects/twitter-and-perceptions-crisis-related-stress), tracking
[geographic differences in public health](http://www.cs.jhu.edu/~mpaul/files/2011.icwsm.twitter_health.pdf) and [analyzing the behavior of automated “bot” accounts](https://regmedia.co.uk/2016/10/19/data-memo-first-presidential-debate.pdf) during the 2016 presidential 
debates, to name a few. At least 25 billion tweets were collected and analyzed in scholarly research published from 2007 to 2012,
according to [a paper Proferes published with a colleague](https://www.emeraldinsight.com/doi/abs/10.1108/AJIM-09-2013-0083?journalCode=ajim) in 2014. They counted 382 papers in just the first six years that 
Twitter existed. Katrin Weller, an information scientist working at the GESIS Leibniz Institute for the Social Sciences in 
Germany, [published a separate paper](https://www.ssoar.info/ssoar/bitstream/handle/document/47768/ssoar-knoworg-2014-3-weller-What_do_we_get_from.pdf?sequence=1) in 2014 that also tried to count social science research papers dealing with Twitter, and she
came up with a similar number, as illustrated in the graph below. Proferes and Weller agreed that many more Twitter-based papers have been published since.

<img src="mkb-twitterreasearch-1.png" />
    
Twitter is a preferred source of social media data for research because the website is relatively easy and cheap for scientists 
to use. Twitter is basically set up as a self-publishing platform. Unless a user makes his or her account private — and the vast 
majority do not — everything posted is public information, just as if you printed it up on a flyer or yelled it loudly in the
town square. But Twitter posts are a lot more useful for data analysis than a guy wandering around screaming, “’Tis 6 o’clock and
all is #blessed!”
    
Using [Twitter’s own systems or third-party apps](https://gwu-libraries.github.io/sfm-ui/posts/2017-09-14-twitter-data) that scrape data from the site, scientists can get [free samples of tweets](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5639727/) — [drawn
at random](http://socialmedia-class.org/twittertutorial.html) either from the site’s ongoing daily stream of other people’s consciousness or (using keywords and [other search 
parameters](https://developer.twitter.com/en/docs/tweets/rules-and-filtering/overview/premium-operators)) from the archive of public tweets that go back years. With bigger budgets, scientists can pay for larger collections 
of tweets. And, since Twitter is predominantly a text-based medium, it’s easier to analyze and compare those messages with one 
another than it would be to study Instagram, which has a similar level of public availability, Cesare said. Earlier this week, 
[Twitter announced that it would begin to put some limits](https://blog.twitter.com/developer/en_us/topics/tools/2018/new-developer-requirements-to-protect-our-platform.html) on how many of these requests people can make and which apps they can 
use to make them — but, in general, Proferes said it’s still one of the easiest social media sites to use for research.

Some researchers have used Twitter to amass huge data sets. Proferes found nine papers published between 2007 and 2012 that were 
based on collections of more than 1 billion tweets each. Jennifer Van Hook, a professor of sociology and demography at Penn 
State, is part of a research team that has collected 30 terabytes of geotagged tweets — something the university has promoted as 
[“the largest publicly accessible archive of human behavior in existence.”](https://news.psu.edu/story/474782/2017/07/17/research/twitter-data-changing-future-population-research) One use for this data set: improving the quality of 
other Twitter-based social science research by figuring out how well the population of Twitter matches the population of a given 
geographic area. Eventually, Van Hook told me, the team hopes to create tools that allow researchers to statistically account for
differences between physical and online communities.

SyntaxError: invalid syntax (<ipython-input-124-79cf118108ac>, line 3)

In [126]:
## License

Please see the [repository README file](https://github.com/theobstacleistheway/Noko-Thonje-Company/blob/master/README.md) for the
licenses and usage terms for the instructional material and code in this notebook. In general, I have licensed this material so 
that it is as widely usable and shareable as possible.

SyntaxError: invalid syntax (<ipython-input-126-25645e185fb3>, line 3)

In [127]:
## Get and Inspect the Data
    
For the purposes of this exercise, we will use the data mined by Mr. Kuda Hove. The data set is 
very small, so what we can draw from the data set is very limited, but, can open the window, 
however small the openining, into understanding what data science can help us with. The period 
covered is from the 18th to the 25th of July 2018.

SyntaxError: invalid syntax (<ipython-input-127-7d06677e8eb2>, line 3)

In [128]:
import pandas as pd

""" Load and concatenate multiple csv files with Twitter data """
    
# Read tweet data (obtained from Mr. K. Hove)
    
df1 = pd.read_csv('Election-Week-Data (18_19-07-2018+G).csv', sep=',', encoding = 'latin 1', header=None, na_values=['NA'], names=["Timestamp", "Additional Info'", "Location", "User", "Screen Name", "Tweet"])
df2 = pd.read_csv('Election-Week-Data (21_23-07-2018+G).csv', sep=',', encoding = 'latin 1', header=None, na_values=['NA'], names=["Timestamp", "Additional Info'", "Location", "User", "Screen Name", "Tweet"])
df3 = pd.read_csv('Election-Week-Data (24_25-07-2018+G).csv', sep=',', encoding = 'latin 1', header=None, na_values=['NA'], names=["Timestamp", "Additional Info'", "Location", "User", "Screen Name", "Tweet"])
    
# Concatenate data
    
frames = [df1, df2, df3]
new_df1 = pd.concat(frames)
    
# Drop irrelevant columns
    
del new_df1["Additional Info'"]

In [129]:
new_df1.head(10)

Unnamed: 0,Timestamp,Location,User,Screen Name,Tweet
0,2018-07-18 23:59:28,,Insola,Zim_Ozil,RT @Taw_1987: It's your time #ED. Zimbabwe nee...
1,2018-07-18 23:57:16,"Mashonaland East, Zimbabwe",Miss_Ndandi,MPrezha,@girlwithacrushh @ConcernedZimCit Aizve mainin...
2,2018-07-18 23:55:57,,Ana,Ana55076611,Mom just told me to never get fat. She doesn't...
3,2018-07-18 23:44:49,"Jhb,South Africa.",C-life4real,cmhembs,RT @DailyNewsZim: Today's #DailyNews front pag...
4,2018-07-18 23:44:39,,I'm Zimbabweð¿ð¼,ImZimbabwe,RT @povozim: Nelson Chamisa closing prayer at ...
5,2018-07-18 23:43:22,"Jhb,South Africa.",C-life4real,cmhembs,RT @ItsMutai: Pres @edmnangagwa of Zimbabwe is...
6,2018-07-18 23:37:38,,Zcash Foundation,ZcashFoundation,RT @OB1Company: Zcon0 Recap - A 3-Day Look at ...
7,2018-07-18 23:37:37,,Ana,Ana55076611,"I've been eating a lot. Too much, I'll get bac..."
8,2018-07-18 23:34:03,,Jones Musara,JonesMusara,@DrNkuSibanda you are a desperate loser. No la...
9,2018-07-18 23:32:23,BULAWAYO,mlungisi mazhale,zhizzy91,RT @seshndlovini: One canât surely request t...


In [130]:
The above represents the first 10 tweets of the entire, concacatanated data set. 

The data given is in a usable format. The first row in the data file defines the column headers,  and the headers are descriptive
enough for us to understand what each column represents.

Each row following the first row represents the elements of a single tweet: timestamp, location  of user, the name of the user, 
the user's screen name and the body of the tweet. 

SyntaxError: invalid syntax (<ipython-input-130-ca0c5d3cba78>, line 1)

In [131]:
## Data Preprocessing
    
Before we proceed any further, we will have to clean the raw text, which means splitting it into text and handling puctuation and
case.

We are particularly interested in the section with the corpus of the tweet.


SyntaxError: invalid syntax (<ipython-input-131-fe0add5c5456>, line 3)

In [132]:
raw = new_df1["Tweet"]
string_raw = ''.join(str(x) for x in raw)

In [133]:
type(string_raw)

str

In [134]:
# Total Words in Data
print(len(string_raw))

2283195


In [135]:
# First 100 characters in the corpus
print(string_raw[0:100])

RT @Taw_1987: It's your time #ED. Zimbabwe needs peace and prosperity. #ZimElections2018 #ZimPreside


In [136]:
#Tokenization to produce a list of words and punctuation
#We want to break up the string into words and punctuation
import nltk

tokens = nltk.word_tokenize(string_raw)
type(tokens)

list

In [137]:
len(tokens)

418630

In [138]:
tokens[:10]

['RT', '@', 'Taw_1987', ':', 'It', "'s", 'your', 'time', '#', 'ED']

In [139]:
#Take the further step of creating an NLTK text from this list
text = nltk.Text(tokens)
type(text)

nltk.text.Text

In [140]:
#We can slice the text to see inside it and perform other actions
text[1020:1060]

['she',
 'was',
 'ZEC',
 'Chairperson',
 'https',
 ':',
 '//t.co/rjRVmaVZ4d',
 'Post',
 'made',
 'by',
 'creaâ\x80¦RT',
 '@',
 'ZESN1',
 ':',
 'Press',
 'statement',
 'on',
 'the',
 'Ballot',
 'Paper',
 '#',
 'ZimVotes2018',
 '#',
 'ZimDecides2018',
 '@',
 'RindaiVava',
 '@',
 'ellendingani',
 '@',
 'OpenParlyZw',
 '@',
 'kubatana',
 '@',
 '263Chat',
 '@',
 'Eleâ\x80¦RT',
 '@',
 'ChiwaraSarah',
 ':',
 '``']

In [141]:
text.collocations()

Dumiso Dabengwa; MDC Alliance; July 2018; Nelson Chamisa; solid
backing; Sibangilizwe Nkomo..son; Joshua Nkomo.The; Godisinit https;
Gweru.Mkoba Stadium; Stadium full; Itâs done; Nkomo.The
momentuâ¦RT; oversubscribed SMART; night.The mood; SMART policy;
Alliance rally; White City; Business Community; policy meeting; last
night.The


In [142]:
## Normalizing Text

#[[ go back to the top ]](#Table-of-contents)

#Convert text to lowercase
#We should ignore the case distinction by normalizing everything to lowercase, and filter out non-alphabetic characters
fdist = nltk.FreqDist(ch.lower() for ch in text if ch.isalpha())
fdist.most_common(500)

[('the', 10802),
 ('to', 7299),
 ('in', 5300),
 ('is', 4924),
 ('of', 4608),
 ('https', 4126),
 ('and', 3732),
 ('a', 3238),
 ('for', 2990),
 ('zimbabwe', 2806),
 ('are', 2311),
 ('nelsonchamisa', 2310),
 ('on', 2293),
 ('you', 2164),
 ('amp', 2149),
 ('we', 2058),
 ('at', 1908),
 ('with', 1878),
 ('that', 1794),
 ('ed', 1669),
 ('has', 1633),
 ('president', 1596),
 ('chamisa', 1589),
 ('this', 1474),
 ('i', 1435),
 ('july', 1362),
 ('rally', 1360),
 ('it', 1337),
 ('our', 1322),
 ('by', 1314),
 ('edhasmyvote', 1259),
 ('not', 1226),
 ('he', 1210),
 ('from', 1204),
 ('zanu', 1145),
 ('will', 1133),
 ('mdc', 1093),
 ('pf', 1065),
 ('povozim', 1027),
 ('people', 1025),
 ('vote', 1013),
 ('all', 986),
 ('mnangagwa', 986),
 ('stadium', 962),
 ('be', 954),
 ('they', 935),
 ('gt', 921),
 ('godisinit', 904),
 ('have', 902),
 ('as', 884),
 ('edmnangagwa', 874),
 ('electionszw', 873),
 ('who', 868),
 ('alliance', 847),
 ('just', 803),
 ('zec', 801),
 ('an', 753),
 ('dabengwa', 743),
 ('change',

In [144]:
## Analysis
    
#We can use the above to create a table of the top 20 mentions of politicial actors 
mentions = pd.read_csv('mentions.csv')
mentions

Unnamed: 0,Keyword,Frequency
0,nelsonchamisa,2310
1,chamisa,1589
2,edhasmyvote,1259
3,zanu,1145
4,mdc,1093
5,pf,1065
6,mnangagwa,986
7,godisinit,904
8,edmnagagwa,874
9,zec,801


In [145]:
A simple analysis of the above shows that Chamisa/MDC mentions numbered 8281 in frequency whereas Mnangagwa/ZANU PF mentions 
had a frequency of 6879 over the interval in which the data was taken. This is not an indication of where election results 
would be given the small sample size and the fact  that ZANU PF's electorate is mostly rural, but it does shoiw the dominance 
of the opposition in social media. In the sample, rt at the end of what are hashtags, e.g., edhasmyvotert, indicates that the 
hashtag was retweeted.

SyntaxError: invalid syntax (<ipython-input-145-8b6790c4a136>, line 1)

In [146]:
## Conclusions
    
Our technology identifies the topics driving public sentiment, helping you to understand not just how people feel, but what’s 
making them feel that way. Our team of human verifiers categorise individual social media posts into a number of 
industry-specific categories from financial services to telcos. Our topics-level analysis empowers you to make tailored 
interventions from customer experience to political campaigning.
    
Based on predefined criteria, public posts from across social media are gathered. Using sophisticated machine learning 
algorithms, each post is evaluated to ensure relevance and the discovery of useful metadata such as location.

A sample of the social media data is distributed to our team of trained human contributors. The team verify the relevancy of the
data and then determine the sentiment contained in the post as being positive, negative, or neutral.

Our team categorises the individual mentions into one or more topic categories. The categorises are tailored to particular
industry needs from financial services to retail outlets.
    
Algorithms by themselves cannot accurately understand human conversation with its tonal subtleties, sarcasm, slang, and mixed 
sentiment. Noko, THonje & Company overcomes these challenges by augmenting its algorithms with a proprietary team of trained 
human contributors. Team members review, verify and categorise the sentiment contained in individual social media posts, helping
us to achieve industry-leading sentiment accuracy.

SyntaxError: invalid syntax (<ipython-input-146-cd789a73b53b>, line 3)