# Tweeter Bot Detection

## Integrating machine learning to detect bots

Over the past ten plus years, Twitter has explosively evolved into a major communication hub. "Its primary purpose is to connect people and allow people to share their thoughts with a big audience. Twitter can also be a very helpful platform for growing a following and providing your audience with valuable content before they even become customers" [Hubspot](https://blog.hubspot.com/marketing/what-is-twitter). However, not all accounts are geninue users. According to a Twitter SEC filling in 2017, Twitter estimated 8.5% of all users to be bots. To validate the credibility of communication exchanged on the platform, efforts in idnetifying spam bots will help improve user's experience on twitter. 

In this project, we will be using [Cresci-2017](https://botometer.iuni.iu.edu/bot-repository/datasets.html) bot repository datasets to detect bot accounts. We'll begin with exploring traits between geniuine and bot accounts. Then, we will imploy supervised learning models (Logistic Regression, Random Forest, Stoachtic Gradient Boosting) to create a Twitter classifer. Finally, we'll use clustering to identify traits among geniue and spam bot accounts. 

 
### Overview of Data
There are a total of 5 files: 
 * 1 example submission files 
 * 2 transaction files (test and train)
 * 2 identity files (test and train) 
 
 We will be merging train transaction and train identity to gain more information regarding detecting fraud. To keep things simple, we will only be using the training sets. Below is a [description](https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203) of the attributes in each table. 
 
__Transaction Table__

- TransactionDT: timedelta from a given reference datetime (not an actual timestamp)
- TransactionAMT: transaction payment amount in USD
- ProductCD: product code, the product for each transaction
- card1 - card6: payment card information, such as card type, card category, issue bank, country, etc.
- addr: address
- dist: distance
- P_ and (R__) emaildomain: purchaser and recipient email domain
- C1-C14: counting, such as how many addresses are found to be associated with the payment card, etc. The actual meaning is masked.
- D1-D15: timedelta, such as days between previous transaction, etc.
- M1-M9: match, such as names on card and address, etc.
- Vxxx: Vesta engineered rich features, including ranking, counting, and other entity relations.
- Categorical Features:
- ProductCD
- card1 - card6
- addr1, addr2
- Pemaildomain Remaildomain
- M1 - M9

__Identity Table__

Variables in this table are identity information – network connection information (IP, ISP, Proxy, etc) and digital signature (UA/browser/os/version, etc) associated with transactions.
They're collected by Vesta’s fraud protection system and digital security partners.
(The field names are masked and pairwise dictionary will not be provided for privacy protection and contract agreement)

Categorical Features:
- DeviceType
- DeviceInfo
- id12 - id38



In [1]:
# Numpy and pandas
import pandas as pd
import numpy as np

#Visualization
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno

# Statistics tools
import scipy.stats as stats

# Sklearn data clean
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

# Model selection
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

# Logistic Regression
from sklearn.linear_model import Lasso, LogisticRegression

# KNN Classifer 
from sklearn.neighbors import KNeighborsClassifier

# Decision Trees
from sklearn import tree
from sklearn.tree import DecisionTreeRegressor
from IPython.display import Image
import pydotplus
import graphviz

# Random Forests 
from sklearn.ensemble import RandomForestClassifier

# SVM
from sklearn.svm import SVC

# Gradient Boost
from xgboost import XGBClassifier

# Evaluate
from sklearn import metrics
from sklearn.metrics import log_loss,accuracy_score, f1_score,roc_auc_score, confusion_matrix, classification_report

# Hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

# Datetime
from datetime import datetime

# Import data
import warnings

In [2]:
fake_tweets = pd.read_csv('/Users/tsawaengsri/Desktop/Data Science Courses/Datasets/cresci-2017.csv/datasets_full.csv/fake_followers.csv/tweets.csv')
fake_tweets.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,created_at,id,text,source,user_id,truncated,in_reply_to_status_id,in_reply_to_user_id,in_reply_to_screen_name,retweeted_status_id,...,retweet_count,reply_count,favorite_count,favorited,retweeted,possibly_sensitive,num_hashtags,num_urls,num_mentions,timestamp
0,Sat Apr 20 13:19:19 +0000 2013,325599560959393793,https://t.co/iocNIgHxXH. @LovesOfaLDNgirl her...,"<a href=""http://twitter.com/download/iphone"" r...",10935572,,0,0,,,...,0,0,0,,,,0,1,1,2013-04-20 15:19:19
1,Tue Apr 16 19:31:39 +0000 2013,324243711443730434,Well done hubby @Allan_76 http://t.co/AaeTwLucUG,"<a href=""http://instagram.com"" rel=""nofollow"">...",10935572,,0,0,,,...,0,0,0,,,,0,1,1,2013-04-16 21:31:39
2,Tue Apr 16 17:38:06 +0000 2013,324215137055670274,Two years with my lovely husband - thank you f...,"<a href=""http://instagram.com"" rel=""nofollow"">...",10935572,,0,0,,,...,0,0,0,,,,0,1,1,2013-04-16 19:38:06
3,Sun Apr 14 15:33:00 +0000 2013,323458877003792386,Sorry bunny about your ears but I was hungry.....,"<a href=""http://instagram.com"" rel=""nofollow"">...",10935572,,0,0,,,...,0,0,0,,,,0,1,0,2013-04-14 17:33:00
4,Fri Apr 12 15:37:59 +0000 2013,322735354148945920,"Small man, big drink @Allan_76 http://t.co/4NU...","<a href=""http://instagram.com"" rel=""nofollow"">...",10935572,,0,0,,,...,0,1,0,,,,0,1,1,2013-04-12 17:37:59


In [3]:
genuine_tweets = pd.read_csv('/Users/tsawaengsri/Desktop/Data Science Courses/Datasets/cresci-2017.csv/datasets_full.csv/genuine_accounts.csv/tweets.csv')
genuine_tweets.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,id,text,source,user_id,truncated,in_reply_to_status_id,in_reply_to_user_id,in_reply_to_screen_name,retweeted_status_id,geo,...,favorited,retweeted,possibly_sensitive,num_hashtags,num_urls,num_mentions,created_at,timestamp,crawled_at,updated
0,593932392663912449,RT @morningJewshow: Speaking about Jews and co...,"<a href=""http://tapbots.com/tweetbot"" rel=""nof...",678033.0,,0.0,0.0,,5.939322e+17,,...,,,,0.0,0.0,1.0,Fri May 01 00:18:11 +0000 2015,2015-05-01 02:18:11,2015-05-01 12:57:19,2015-05-01 12:57:19
1,593895316719423488,This age/face recognition thing..no reason pla...,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",678033.0,,0.0,0.0,,0.0,,...,,,,0.0,0.0,0.0,Thu Apr 30 21:50:52 +0000 2015,2015-04-30 23:50:52,2015-05-01 12:57:19,2015-05-01 12:57:19
2,593880638069018624,Only upside of the moment I can think of is th...,"<a href=""http://twitter.com"" rel=""nofollow"">Tw...",678033.0,,0.0,0.0,,0.0,,...,,,,2.0,0.0,0.0,Thu Apr 30 20:52:32 +0000 2015,2015-04-30 22:52:32,2015-05-01 12:57:19,2015-05-01 12:57:19
3,593847955536252928,If you're going to think about+create experien...,"<a href=""http://tapbots.com/tweetbot"" rel=""nof...",678033.0,,0.0,0.0,,0.0,,...,,,,2.0,0.0,0.0,Thu Apr 30 18:42:40 +0000 2015,2015-04-30 20:42:40,2015-05-01 12:57:19,2015-05-01 12:57:19
4,593847687847350272,Watching a thread on FB about possible future ...,"<a href=""http://tapbots.com/tweetbot"" rel=""nof...",678033.0,,0.0,0.0,,0.0,,...,,,,0.0,0.0,0.0,Thu Apr 30 18:41:36 +0000 2015,2015-04-30 20:41:36,2015-05-01 12:57:19,2015-05-01 12:57:19


In [4]:
# Create a DataFrame from `tweets`
fake_text = pd.DataFrame(fake_tweets['text'])
fake_text.head()

Unnamed: 0,text
0,https://t.co/iocNIgHxXH. @LovesOfaLDNgirl her...
1,Well done hubby @Allan_76 http://t.co/AaeTwLucUG
2,Two years with my lovely husband - thank you f...
3,Sorry bunny about your ears but I was hungry.....
4,"Small man, big drink @Allan_76 http://t.co/4NU..."


In [5]:
# Create a DataFrame from `tweets`
real_text = pd.DataFrame(genuine_tweets['text'])
real_text.head()

Unnamed: 0,text
0,RT @morningJewshow: Speaking about Jews and co...
1,This age/face recognition thing..no reason pla...
2,Only upside of the moment I can think of is th...
3,If you're going to think about+create experien...
4,Watching a thread on FB about possible future ...


In [None]:
def check_word_in_tweet(word, data):
    """Checks if a word is in a Twitter dataset's text. 
    Checks text and extended tweet (140+ character tweets) for tweets,
    retweets and quoted tweets.
    Returns a logical pandas Series.
    """
    contains_column = data['text'].str.contains(word, case = False)
    contains_column |= data['extended_tweet-full_text'].str.contains(word, case = False)
    contains_column |= data['quoted_status-text'].str.contains(word, case = False)
    contains_column |= data['quoted_status-extended_tweet-full_text'].str.contains(word, case = False)

In [8]:
import nltk
nltk.downloader.download('vader_lexicon')
# Load SentimentIntensityAnalyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Instantiate new SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

# Generate sentiment scores
fake_sen_scores = fake_text['text'].apply(sid.polarity_scores)

real_sen_scores = real_text['text'].apply(sid.polarity_scores)

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/tsawaengsri/nltk_data...


AttributeError: 'float' object has no attribute 'encode'

In [None]:
# Plot average #python sentiment per day
plt.plot(sentiment_py.index.day, sentiment_py, color = 'green')

# Plot average #rstats sentiment per day
plt.plot(sentiment_r.index.day, sentiment_r, color = 'blue')

plt.xlabel('Day')
plt.ylabel('Sentiment')
plt.title('Sentiment of data science languages')
plt.legend(('#python', '#rstats'))
plt.show()

In [9]:
fake_tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 196027 entries, 0 to 196026
Data columns (total 23 columns):
created_at                 196027 non-null object
id                         196027 non-null int64
text                       196007 non-null object
source                     196027 non-null object
user_id                    196027 non-null int64
truncated                  0 non-null float64
in_reply_to_status_id      196027 non-null int64
in_reply_to_user_id        196027 non-null int64
in_reply_to_screen_name    27159 non-null object
retweeted_status_id        0 non-null float64
geo                        0 non-null float64
place                      1683 non-null object
contributors               0 non-null float64
retweet_count              196027 non-null int64
reply_count                196027 non-null int64
favorite_count             196027 non-null int64
favorited                  0 non-null float64
retweeted                  0 non-null float64
possibly_sensitive     

In [12]:
# tweets per day
len(fake_tweets['user_id'].unique())

3202

In [13]:
fake_tweets['timestamp'].describe()

count                  196027
unique                 193671
top       2012-05-07 00:47:18
freq                       11
Name: timestamp, dtype: object

In [None]:
# tweet over time 

In [None]:
# pauses between tweets

# Exploratory Data Analysis 
Exploratory Data Analysis (EDA) is an iterative process to explore the data and summarize characteristics by calculating statistics or visualize methods. The purpose of EDA is gain an understanding of the data by identifying trends, anomalies, or relationships that might be helpful when making decisions in the modeling process.

In [16]:
tweet = pd.read_csv('/Users/tsawaengsri/Desktop/twitter_datasets.z01')
tweet.head()

ParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
