# Tweeter Bot Detection

## Integrating machine learning to detect bots

Over the past ten plus years, Twitter has explosively evolved into a major communication hub. "Its primary purpose is to connect people and allow people to share their thoughts with a big audience. Twitter can also be a very helpful platform for growing a following and providing your audience with valuable content before they even become customers" [Hubspot](https://blog.hubspot.com/marketing/what-is-twitter). However, not all accounts are geninue users. According to a Twitter SEC filling in 2017, Twitter estimated 8.5% of all users to be bots. To validate the credibility of communication exchanged on the platform, efforts in idnetifying spam bots will help improve user's experience on twitter. 

In this project, we will be using [Cresci-2017](https://botometer.iuni.iu.edu/bot-repository/datasets.html) bot repository datasets to detect bot accounts. We'll begin with exploring traits between geniuine and bot accounts. Then, we will imploy supervised learning models (Logistic Regression, Random Forest, Stoachtic Gradient Boosting) to create a Twitter classifer. Finally, we'll use clustering to identify traits among geniue and spam bot accounts. 

 
### Overview of Data
There are a total of 5 files: 
 * 1 example submission files 
 * 2 transaction files (test and train)
 * 2 identity files (test and train) 
 
 We will be merging train transaction and train identity to gain more information regarding detecting fraud. To keep things simple, we will only be using the training sets. Below is a [description](https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203) of the attributes in each table. 
 
__Transaction Table__

- TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
- TransactionAMT: transaction payment amount in USD
- ProductCD: product code, the product for each transaction
- card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
- addr: address
- dist: distance
- P_ and (R__) emaildomain: purchaser and recipient email domain
- C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
- D1-D15: timedelta, such as days between previous transaction, etc.
- M1-M9: match, such as names on card and address, etc.
- Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.
- Categorical Features:
- ProductCD
- card1 - card6
- addr1, addr2
- Pemaildomain Remaildomain
- M1 - M9

__Identity Table__

Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions.
They're collected by Vesta’s fraud protection system and digital security partners.
(The field names are masked and pairwise dictionary will not be provided for privacy protection and contract agreement)

Categorical Features:
- DeviceType
- DeviceInfo
- id12 - id38


In [1]:
# Numpy and pandas
import pandas as pd
import numpy as np

#Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

# Statistics tools
import scipy.stats as stats

# Sklearn data clean
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

# Model selection
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

# Logistic Regression
from sklearn.linear_model import Lasso, LogisticRegression

# KNN Classifer 
from sklearn.neighbors import KNeighborsClassifier

# Decision Trees
from sklearn import tree
from sklearn.tree import DecisionTreeRegressor
from IPython.display import Image
import pydotplus
import graphviz

# Random Forests 
from sklearn.ensemble import RandomForestClassifier

# SVM
from sklearn.svm import SVC

# Gradient Boost
from xgboost import XGBClassifier

# Evaluate
from sklearn import metrics
from sklearn.metrics import log_loss,accuracy_score, f1_score,roc_auc_score, confusion_matrix, classification_report

# Hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

# Datetime
from datetime import datetime

# Import data
import warnings

In [2]:
# Import genuine accounts 
g_users = pd.read_csv('/Users/tsawaengsri/Desktop/Data Science Courses/Datasets/cresci-2017.csv/datasets_full.csv/genuine_accounts.csv/users.csv')

# Import spam bot accounts
soc_bot1_users = pd.read_csv('/Users/tsawaengsri/Desktop/Data Science Courses/Datasets/cresci-2017.csv/datasets_full.csv/social_spambots_1.csv/users.csv')
soc_bot2_users = pd.read_csv('/Users/tsawaengsri/Desktop/Data Science Courses/Datasets/cresci-2017.csv/datasets_full.csv/social_spambots_2.csv/users.csv')
soc_bot3_users = pd.read_csv('/Users/tsawaengsri/Desktop/Data Science Courses/Datasets/cresci-2017.csv/datasets_full.csv/social_spambots_3.csv/users.csv')


In [3]:
print('Genuine tweets')
print(g_users.shape)

print('---------------')

print('Spam bot tweets')
print(soc_bot1_users.shape)
print(soc_bot2_users.shape)
print(soc_bot3_users.shape)

Genuine tweets
(3474, 42)
---------------
Spam bot tweets
(991, 41)
(3457, 40)
(464, 41)


In [4]:
soc_bot2_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3457 entries, 0 to 3456
Data columns (total 40 columns):
id                                    3457 non-null int64
name                                  3457 non-null object
screen_name                           3457 non-null object
statuses_count                        3457 non-null int64
followers_count                       3457 non-null int64
friends_count                         3457 non-null int64
favourites_count                      3457 non-null int64
listed_count                          3457 non-null int64
url                                   10 non-null object
lang                                  3457 non-null object
time_zone                             14 non-null object
location                              13 non-null object
default_profile                       15 non-null float64
default_profile_image                 46 non-null float64
geo_enabled                           5 non-null float64
profile_image_url       

In [None]:
# Select common columns 
g_users = g_users.loc[:,["id", "name", "screen_name", "statuses_count", "followers_count", "friends_count", "lang", "default_profile", "protected", "verified", "description", "contributors_enabled"]]
soc_bot1_users = soc_bot1_users.loc[:,["id", "name", "screen_name", "statuses_count", "followers_count", "friends_count", "lang", "default_profile", "protected", "verified", "description", "contributors_enabled"]]
soc_bot2_users = soc_bot2_users.loc[:,["id", "name", "screen_name", "statuses_count", "followers_count", "friends_count", "lang", "default_profile", "protected", "verified", "description", "contributors_enabled"]]
soc_bot3_users = soc_bot3_users.loc[:,["id", "name", "screen_name", "statuses_count", "followers_count", "friends_count", "lang", "default_profile", "protected", "verified", "description", "contributors_enabled"]]


## Merging datasets
The transaction training file lacks identity information on each transaction, so we will merge the training identity and transaction files on TransactionID. Since each observation has a unique transaction ID, we will do a 1 to 1 join. We will be performing a left join since our focus remains on the transaction table.

Below, we can see that the row length for train transaction and identity are not equal. It was noted that Vesta was unable to obtain all identity information so we will continue with the merge table for now.

In [None]:
b_users = pd.concat([soc_bot1_users,soc_bot2_users,soc_bot3_users], ignore_index=True, sort=False)

In [None]:
# Create tweet class, 1 for bot and 0 for genuine tweets
b_users['class'] = 1
g_users['class'] = 0

In [None]:
# Concatenate df
df = pd.concat([b_users,g_users], ignore_index=True, sort=False)

# Randomly shuffle df 
df = df.reindex(np.random.permutation(df.index))

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info()

## Missing Values
Now, we will take a look at missing values in each column.

In [None]:
# Function to calculate missing values by column
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

In [None]:
missing_values_table(df)

In [None]:
msno.matrix(df)

As shown in the matrix, many columns such as truncated, geo, contributors, etc. appear to be missing all datapoints. The data doesn't appear to be missing at random since there is a repetitive pattern between those columns. 

We will drop columns missing more than 90% of data points and in_reply_to_screen_name since that isn't our focus. Then, we will drop the remaining rows with missing values since that is only a small percentage of the dataset.

In [None]:
# Get columns with >= 80% missing
missing_df = missing_values_table(df)
missing_columns = list(missing_df[missing_df['% of Total Values'] >= 90].index)
print('We will drop %d columns.' % len(missing_columns))
print('Drop columns: ', missing_columns)

In [None]:
df.drop(labels=['truncated', 'geo', 'contributors', 'favorited', 'retweeted', 'possibly_sensitive', 'place','in_reply_to_screen_name'], axis=1, inplace=True)

In [None]:
# Drop all rows that have any NaN values
df.dropna()

In [None]:
missing_values_table(df)

## Creating Time Series 

Time series data is used when we want to analyze or explore variation over time. This is useful when exploring Twitter text data if we want to track the prevalence of a word or set of words.

Let's convert timestamp into time datatype. 

In [None]:
# Print created_at to see the original format of datetime in Twitter data
print(df['created_at'].head())

In [None]:
# Convert the created_at column to np.datetime object
df['created_at'] = pd.to_datetime(df['created_at'])

In [None]:
# Print created_at to see new format
print(df['created_at'].head())

In [None]:
# Set the index of df to created_at
df = df.set_index('created_at')

In [None]:
df.describe()

## Exploratory Data Analysis
Exploratory Data Analysis (EDA) is an iterative process to explore the data and summarize characteristics by calculating statistics or visualize methods. The purpose of EDA is gain an understanding of the data by identifying trends, anomalies, or relationships that might be helpful when making decisions in the modeling process.

In [None]:
# Bot vs human tweets 
counts = df['class'].value_counts()
human = counts[0]
bot = counts[1]
human_per = (human/(human + bot))*100
bot_per = (bot/(human + bot))*100
print('There are {} tweets made by humans({:.3f}%) and {} tweets made by bots ({:.3f}%) in this dataset.'.format(human, human_per, bot, bot_per))

In [None]:
# Plot target variable
plt.figure(figsize=(12,6))
g = sns.countplot(x = 'class', data = df)
g.set_title('Count of Tweets made by Humans vs Bots', fontsize = 17)
g.set_xlabel('User Type', fontsize = 15)
g.set_ylabel('Tweets', fontsize = 15)

for p in g.patches:
    height = p.get_height()
    g.text(p.get_x()+p.get_width()/2.,
            height + 3,
            '{:1.2f}%'.format(height/len(df) * 100),
            ha="center", fontsize=15) 

In [None]:
account = df.groupby(by='user_id', as_index=False).agg({'class':pd.Series.nunique})
account.describe()



In [None]:
# Total user accounts 
account = df.groupby(by='user_id', as_index=False).agg({'class':pd.Series.nunique})
human = account[0]
bot = account[1]
human_per = (human/(human + bot))*100
bot_per = (bot/(human + bot))*100
print('There are {} total accounts: {} are human({:.3f}%) and {} are bots({:.3f}%).'.format(len(account),human, human_per,bot,bot_per))

In [None]:
# Create a python column
ds_tweets['python'] = check_word_in_tweet('#python', ds_tweets)

# Create an rstats column
ds_tweets['rstats'] = check_word_in_tweet('#rstats', ds_tweets)

In [None]:
# Average of bot tweet per day
bt_per_day = df['class'].resample('1 d').mean()

# Average of genuine tweet per day
#gt_per_day = df['class'==0].resample('1 d').mean()

# Plot average tweet per day
plt.plot( bt_per_day.index.day,bt_per_day, color = 'green')
#plt.plot( gt_per_day.index.day,gt_per_day, color = 'blue')

# Add labels and show
plt.xlabel('Day'); plt.ylabel('Frequency')
plt.title('Number of Tweets')
#plt.legend(('Bot', 'Human'))
plt.show()

In [None]:
# tweets per day

In [None]:
# tweet over time 

In [None]:
# pauses between tweets