# Tweet Bot Behavior in California Fires -Data Preparation-
### Takeshi Oda

## 1. About this data analysis
### 1.1. Twitter Bot
Twitter Bot is a type of twitter user account which is programmed to post messages into twitter autonomously.   
It is known that improper uses of twitter bots are causing harmful effect on public communication in social media since they are sometimes used to manipulate public opinion or confuse truth with fake news. 

### 1.2. California Fires
The recent and worst wild fire in California ‘Campfire’ occurred on 8th November 2018 and killed around 85 people (as of 02/12/20018). 

### 1.3. My question
My question in this project is:  

**'Is there any differences in behavior between Bot and Human on Twitter in the field of natural disaster?'**     

I guessed twitter was used to claim political or environmental opinion by Bots as well as expressing sadness or hope on events through this disaster. I would learn whether there is any different pattern of tweets between bots and normal users. Also, I would learn which type of accounts are more likely to post political or environmental mentioning.

### 1.4. Approach
Tweets about California Fires were collected and probability of being Bot is assigned to each tweet.  
To assign the probability, I utilized an API **'Botometer'** which is provided by **Indiana University and the Center for Complex Networks and Systems Research (CNetS)**.

Tweets were divided into two groups,i.e. Bot and Non-Bot in R and statistical testing was conducted on several metrics.

## 2. Library settings

In [1]:
import tweepy 
import botometer
from time import sleep
from datetime import datetime
from textblob import TextBlob
import matplotlib.pyplot as plt
import csv
import pandas as pd
from collections import Counter
import string

## 3. Data Collection 

### 3.1. Tweepy 
I made a plain text file called twitter_credentials.py, and put it into my home directory.  
It contains credentials for Twitter API and Botometer API.

    --- Credentials for Twitter API ---
    consumer_key = '...'
    consumer_secret = '...'
    access_token = '...'
    access_token_secret = '...'
    --- User Key for Botometer API ---
    mashape_key = "..."  

### 3.2. Tweet collection for California Fires
By using tweepy, I extracted 60004 tweets with the keyward '#CaliforniaFires'.

In [2]:
%run ~/twitter_credentials.py

## COllect Tweets by Keyword ##
def collect_tweet(keyword, file, notweet=False):
    """
    Collect tweet by keyword
        
    Parameters
    --------------
    keyword
     Search key word such as hashtags, user name
    file
     file path to which returned tweet data is written
    notweet
     Set True if you do not want to receive text message from twitter.
     Only date, retweet count, user name and screen name are returned in this case.
        
    Returns
    --------------
    None
        
    """
    auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    api = tweepy.API(auth, wait_on_rate_limit = True, wait_on_rate_limit_notify = True) #wait for unlocking rate limit

    user = api.me()
    
    numOfTweets = 60000 #Number of tweets to collect
    #numOfTweets = 100
    msglist = []
    all_msg = []
    
    backoff_counter = 1
    while True:
        try:
            for tweet in tweepy.Cursor(api.search, q=keyword, lang="en", tweet_mode="extended").items(numOfTweets):
                
                message = TextBlob(tweet.full_text)
                message = message.strip()
                user = tweet.user
                user_name = user.name
                screen_name = user.screen_name
                if notweet == True:
                    msglist.append( ( tweet.created_at, tweet.retweet_count, user_name, screen_name) )
                else:
                    msglist.append( ( tweet.created_at, tweet.retweet_count, user_name, screen_name, message ) )
                all_msg.append(message)
                
            with open(file, 'w',newline='',encoding='utf-8') as f:
                writer = csv.writer(f, lineterminator='\n')
                writer.writerow(["created_at","retweet", "user_name", "screen_name", "message"])
                writer.writerows(msglist)
            break
                               
        except tweepy.TweepError as e:
            print(e.reason)
            sleep(60*backoff_counter)
            backoff_counter += 1
            continue
        except:
            print("Unexpected Error")
            sleep(60*backoff_counter)
            backoff_counter += 1
            continue

data_dir = "data"
keyword1 = "#%23CaliforniaFires"  # #CaliforniaFires
file1 = data_dir + "/CaliforniaFires_20181117-20181126.txt"

collect_tweet(keyword1, file1, False)

Extracted tweets contain tweets from 17th November to 26th November.
From this data, I selected tweets from 17th Nov to 20th Nov to reduce specific trend of tweets from thanks giving holiday.

In [3]:
file = data_dir + "/" + "CaliforniaFires_20181117-20181126.txt"
file_out = data_dir + "/" + "CaliforniaFires.txt"

tweets = pd.read_csv(file)   
tweets_sub = tweets[tweets.created_at < '2018-11-21 00:00:00']
tweets_sub.to_csv(file_out, index=False)

## 4. Adding "Bot Probability" 
After we retrieve tweets from hashtag #CaliforniaFires, we assigned probability of being Bots for each tweets using 'Botmeter'. 
Since calling this API takes long time, randome sample was taken from CaliforniaFires.txt

In this research, I took two randome samples.

1) 2000 tweets were randomely taken from CaliforniaFires.txt and probability of 1912 accounts were assigned via Botmeter

2) 4000 samples were randomely taken from CaliforniaFires.txt and probability of 3686 
accounts were assigned via Botmeter

In [5]:
def evaluate_bot(user_list):
    """
    Evaluate probability of user account's being Bot
        
    Parameters
    --------------
    user_list
     A list of screen name
        
    Returns
    --------------
     A list of screen name, probability of being Bot 
        
    """    
    bot_prob_list = []
    
    twitter_app_auth = {
        'consumer_key': consumer_key,
        'consumer_secret': consumer_secret,
        'access_token': access_token,
        'access_token_secret': access_token_secret,
      }
    
    #Call Botometer
    bom = botometer.Botometer(wait_on_ratelimit=True,
                              mashape_key=mashape_key,
                              **twitter_app_auth)
    try:
        i = 1
        for screen_name, result in bom.check_accounts_in(user_list):
            try:
                cap_en = result["cap"]["english"]
                cap_unv = result["cap"]["universal"]
                bot_prob = { "screen_name": screen_name, "cap_en": cap_en, "cap_unv": cap_unv }
                bot_prob_list.append(bot_prob)
                
                i = i + 1
            except KeyError as ke:
                print(ke)
                print("KeyError-screen name:" + screen_name)
            except ConnectionError as ce:
                print(ce)
                print("ConnectionError-screen name:" + screen_name)
            except Exception as e:
                print(e)
                print("Exception-screen name:" + screen_name)
            except:
                print("Unexpected-screen name:" + screen_name)               
    except:
        print("Unexpected(check_accounts)-screen name:" + screen_name)
        print(result)             
    
    return bot_prob_list


#Read tweet list
data_dir = "data"
file = data_dir + "/" + "CaliforniaFires.txt"
file_out = data_dir + "/" + "CaliforniaFires_BotProb.csv"

tweets = pd.read_csv(file)    
tweets = tweets.drop("created_at", axis=1)
tweets = tweets.drop_duplicates()

#Radom sampling from CaliforniaFires.txt
tweets_sample = tweets.sample(n=4000)

#Extract user list from the sample
user_list = list(tweets_sample["screen_name"].unique())

user_list_tmp = user_list
while(True):
    #Consider to recall evaluate_bot in case error is raised
    prob_list = evaluate_bot(user_list_tmp)
    last_name = prob_list[-1]["screen_name"]
    if last_name == user_list[-1] : #if all user was evaluated
        break
    last_index = user_list_tmp.index(last_name)
    #create list from the next to the last user in prob_list 
    user_list_tmp = user_list_tmp[last_index+1:]

#Write bot probability
with open(file_out, 'w',newline='',encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=["screen_name", "cap_en", "cap_unv"])
    writer.writeheader()
    for prob in prob_list:
        writer.writerow(prob)

'cap'
KeyError-screen name:SistaM_satx
'cap'
KeyError-screen name:RostislavVasko


## 5. Variable creation and file generation for R

### 5.1. Add bot probability to tweets
I joined the tweet file 'CaliforniaFires.txt' with Bot probability file CaliforniaFires_BotProb.csv' by screen name.  
Now, Bot probability was added to each tweet.

### 5.2. Remove duplicated tweets
I removed duplicated tweets which have same user name and same message.

### 5.3. Add secondary data source to tweets
Two files are created in advance of this data processing.

1) policalwords.txt

This file contain 100 words related to political scandal.  
This word list is created from  
https://www.vocabulary.com/lists/183710.  
This data is used to define variable 'num_political_word'

2) environmentalwords.txt

This file contains 123 words related to environment.  
This word list is created from  
https://www.englisch-hilfen.de/en/words/environment.htm  
This data is used to define variable 'num_environmental_word'

### 5.4. Add variables

Below variables were defined.


|variable   | description |
|:-------|----------|
|retweet | How many times the tweets were retweeted by other users |
|user_name| Twitter user name  |
|screen_name| Twitter screen name  |
|message  | Tweeted message  |
|bot_probability  | Probability of being Bot. This value is taken from CAP(Complete Automation Probability) returned from Botometer. |
|num_word  | Number of words in the message  |
|num_question  | Number of question mark '?' in the message  |
|num_exclamation  | Number of exclamation mark '!' in the message  |
|num_digit_screen_name  | Number of digit(0-9) in user screen name  |
|num_political_word  | Number of political word in the mssage|
|num_environmental_word  | Number of environmental word in the mssage|
|include_retweet  | count 1 if the message is retweet of other tweet|
|num_hashtag  | Number of hashtag in the mssage|

### 5.5. Generate dataset for R
Data set was exported as 'CaliforniaFires_Tweet_Stats.csv' for statistical analysis in R.

In [8]:
#Remove punctuation from sentences
def depunctify(sentences):
    sentences = sentences.replace("--", " ")
    punct = string.punctuation + "‘’”“"
    for p in punct:
        sentences = sentences.replace(p, "")
    return sentences    

#Count number of words in sentences
def count_num_words(row):
    sentence = depunctify(row.message)
    sentence = sentence.replace("\n", " ")
    return len(sentence.split())

#Count number of question mark
def count_num_question(row):
    return row.message.count("?")

#Count number of exclamation mark
def count_num_exclamation(row):
    return row.message.count("!")

#Count number of digits in screen name
def count_num_digit(row):
    return sum(c.isdigit() for c in row.screen_name) 

#Count number of hashtag in tweet
def count_num_hashtag(row):
    return row.message.count("#")

#Count number of specific words
def count_num_specific_words(row, df_word):
    cnt = 0
    msg = row.message
    msg = msg.lower()
    word_list = msg.split()
    
    for i in range(len(df_word)):
        cnt += df_word.iloc[i,0] in word_list
    return cnt

#Decide whether message include 'Global Warming'
def include_word(row, word):
    msg = row.message
    msg = msg.lower()
    word = word.lower()
    idx = msg.find(word)
    if idx > -1:
        return 1
    else:
        return 0

data_dir = "data"
tweet_file = data_dir + "/CaliforniaFires.txt"
botprob_file = data_dir + "/CaliforniaFires_BotProb.csv"
political_file = data_dir + "/policalwords.txt"
environmental_file = data_dir + "/environmentalwords.txt"
tweet_stats_file = data_dir + "/CaliforniaFires_Tweet_Stats.csv"

tweet = pd.read_csv(tweet_file)
prob = pd.read_csv(botprob_file) 
political_words = pd.read_table(political_file, names=["word"])
env_words = pd.read_table(environmental_file, names=["word"])

#Join two dataframes by screen name
tweet_df = pd.merge(tweet, prob, on="screen_name")

tweet_df = tweet_df.drop("cap_unv", axis=1)
tweet_df = tweet_df.drop("created_at", axis=1)

tweet_df = tweet_df.rename(columns = {"cap_en": "bot_probability"})
tweet_df = tweet_df.drop_duplicates()
tweet_df = tweet_df.sort_values(["screen_name","message","retweet"], ascending=[True, True, False])
tweet_df.loc[:, "row_num"] = tweet_df.groupby(["screen_name", "message"]).cumcount()
tweet_df = tweet_df[tweet_df.row_num == 0]

tweet_df.loc[:, "num_words"] = tweet_df.apply(count_num_words, axis=1)
tweet_df.loc[:, "num_question"] = tweet_df.apply(count_num_question, axis=1)
tweet_df.loc[:, "num_exclamation"] = tweet_df.apply(count_num_exclamation, axis=1)
tweet_df.loc[:, "num_digit_screen_name"] = tweet_df.apply(count_num_digit, axis=1)
tweet_df.loc[:, "num_political_word"] = tweet_df.apply(count_num_specific_words, df_word=political_words, axis=1)
tweet_df.loc[:, "num_environmental_word"] = tweet_df.apply(count_num_specific_words, df_word=env_words, axis=1)
tweet_df.loc[:, "include_retweet"] = tweet_df.apply(include_word, word="RT", axis=1)
tweet_df.loc[:, "num_hashtag"] = tweet_df.apply(count_num_hashtag, axis=1)


tweet_df.to_csv(tweet_stats_file, index=False)


## 6. Reference
https://www.usnews.com/news/top-news/articles/2018-11-25/number-of-missing-in-deadly-california-wildfire-revised-down-more-rain-on-the-way  

https://botometer.iuni.iu.edu/#!/
