# Exam Project: The formation of ISIS' Social Media Network
Group members: Zeyu Zhao, Helge Zille, Edith Zink, Sina Smid

**This Notebook prepares the tweets for the text analysis and contains**

**1. Preprocessing of tweets - data cleaning:**
- Prepare a preprocessed and clean `tweets` column for text analysis

**2. Descriptive: word frequencies**:
- Top 25 words with and without stop-words
- Wordclouds 
- TF-IDF

**3. Tokenization**:
- Method 1: Word split using NLTK
- Method 2: Sentiment analysis
- Method 3: Deepmoji analysis 

In [None]:
import os
import requests
import re
import networkx as nx
from networkx.drawing.nx_agraph import graphviz_layout
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from collections import Counter
plt.style.use('ggplot')
import datetime
import wordcloud
from wordcloud import WordCloud
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.externals import joblib
# NLTK 
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import TweetTokenizer
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk import bigrams 
from textblob import TextBlob # pip install -U textblob
# Functions
from our_functions import *

**Data import:** 
- Dataset is downloaded from Kaggle (https://www.kaggle.com/fifthtribe/how-isis-uses-twitter)

In [None]:
data = read_tweets('tweets_1.csv')
data

**1.Preprocessing**: Takes care of the following parts of the tweets:
- References: `@`
- Retweets: `RT`
- Hashtags to topics: `#`
- Links: `http\`

**STEP 1 Preprocessing **

In [3]:
# TO-DO What is about the \n??
# How to reasonable delete non-word characters?

# Create a new coloumn for preprocessed tweets
def extract_from_to_column(data, regex, from_col, to_col):
    data[to_col]=data[from_col].apply(lambda x: " ".join(regex.findall(x)))
    return data

def remove_from_body(data,regex):
    data['tweets_prepr']=data.tweets.apply(lambda x: re.sub(regex,'',x))
    return data

regex1 = re.compile("@(\S+)")   # tagged users
regex2 = re.compile("http\S+")  # urls
regex3 = re.compile("ENGLISH TRANS[^:]*:") # prefix
regex4 = re.compile("#(\S+)") # hashtags
regex5 = re.compile("RT\s") # retweets

data = extract_from_to_column(data, regex1, 'tweets', 'tags')
data = remove_from_body(data, regex1)

data = extract_from_to_column(data, regex2, 'tweets', 'links')
data = remove_from_body(data, regex2)

data = remove_from_body(data, regex3)

data = extract_from_to_column(data, regex4, 'tweets', 'hashtag')
data = remove_from_body(data, regex4)

data = extract_from_to_column(data, regex5, 'tweets', 'retweets')
data = remove_from_body(data, regex5)

data.tags = data.tags.str.split()
data.head(1000)

Unnamed: 0,name,username,description,location,followers,numberstatuses,time,tweets,date,translated,tags,tweets_prepr,links,hashtag,retweets
0,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,2015-01-06 21:07:00,'A MESSAGE TO THE TRUTHFUL IN SYRIA - SHEIKH ...,2015-01-06,True,[],'A MESSAGE TO THE TRUTHFUL IN SYRIA - SHEIKH ...,http://t.co/73xFszsjvr http://t.co/x8BZcscXzq,,
1,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,2015-01-06 21:27:00,SHEIKH FATIH AL JAWLANI 'FOR THE PEOPLE OF IN...,2015-01-06,True,[],SHEIKH FATIH AL JAWLANI 'FOR THE PEOPLE OF IN...,http://t.co/uqqzXGgVTz http://t.co/A7nbjwyHBr,,
2,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,2015-01-06 21:29:00,FIRST AUDIO MEETING WITH SHEIKH FATIH AL JAWL...,2015-01-06,True,[],FIRST AUDIO MEETING WITH SHEIKH FATIH AL JAWL...,http://t.co/TgXT1GdGw7 http://t.co/ZuE8eisze6,,
3,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,2015-01-06 21:37:00,"SHEIKH NASIR AL WUHAYSHI (HA), LEADER OF AQAP...",2015-01-06,True,[],"SHEIKH NASIR AL WUHAYSHI (HA), LEADER OF AQAP...",http://t.co/3qg5dKlIwr http://t.co/7bqk1wJAzC,,
4,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,2015-01-06 21:45:00,AQAP: 'RESPONSE TO SHEIKH BAGHDADIS STATEMENT...,2015-01-06,True,[],AQAP: 'RESPONSE TO SHEIKH BAGHDADIS STATEMENT...,http://t.co/2EYm9EymTe,,
5,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,2015-01-06 21:51:00,THE SECOND CLIP IN A DA'WAH SERIES BY A SOLDIE...,2015-01-06,False,[],THE SECOND CLIP IN A DA'WAH SERIES BY A SOLDIE...,http://t.co/EPaPRlph5W http://t.co/4VUYszairt,,
6,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,2015-01-06 22:04:00,OH MURABIT! : http://t.co/hujLj9KGkG http://t...,2015-01-06,True,[],OH MURABIT! : http://t.co/hujLj9KGkG http://t...,http://t.co/hujLj9KGkG http://t.co/t9IxMtBVGK,,
7,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,2015-01-06 22:06:00,'A COLLECTION OF THE WORDS OF THE U'LAMA REGA...,2015-01-06,True,[],'A COLLECTION OF THE WORDS OF THE U'LAMA REGA...,http://t.co/AJbayWNxDQ http://t.co/mAycbhaUzH,,
8,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,2015-01-06 22:17:00,Aslm Please share our new account after the pr...,2015-01-06,False,"[KhalidMaghrebi, seifulmaslul123, CheerLeadUni...",Aslm Please share our new account after the pr...,,,
9,GunsandCoffee,GunsandCoffee70,ENGLISH TRANSLATIONS: http://t.co/QLdJ0ftews,,640,49,2015-01-10 00:05:00,AQAP STATEMENT REGARDING THE BLESSED RAID IN ...,2015-01-10,True,[],AQAP STATEMENT REGARDING THE BLESSED RAID IN ...,http://t.co/qvErFO25Qj http://t.co/YIcnGMVjiX,,


In [4]:
data['tweets_prepr'].head()

0     'A MESSAGE TO THE TRUTHFUL IN SYRIA - SHEIKH ...
1     SHEIKH FATIH AL JAWLANI 'FOR THE PEOPLE OF IN...
2     FIRST AUDIO MEETING WITH SHEIKH FATIH AL JAWL...
3     SHEIKH NASIR AL WUHAYSHI (HA), LEADER OF AQAP...
4     AQAP: 'RESPONSE TO SHEIKH BAGHDADIS STATEMENT...
Name: tweets_prepr, dtype: object

**STEP 2 Preprocessing **

In [5]:
# Remove punctuations and additional signs in tweets_prepr column
data['tweets_prepr'] = data['tweets_prepr'].str.replace('[^\w\s]','')
data['tweets_prepr'].head()

0     A MESSAGE TO THE TRUTHFUL IN SYRIA  SHEIKH AB...
1     SHEIKH FATIH AL JAWLANI FOR THE PEOPLE OF INT...
2     FIRST AUDIO MEETING WITH SHEIKH FATIH AL JAWL...
3     SHEIKH NASIR AL WUHAYSHI HA LEADER OF AQAP TH...
4     AQAP RESPONSE TO SHEIKH BAGHDADIS STATEMENT A...
Name: tweets_prepr, dtype: object

**STEP 3 Preprocessing **

In [None]:
# Spelling corrections with textblob library
data['tweets'][:5].apply(lambda x: str(TextBlob(x).correct()))
data['tweets'].head()

**STEP 4 Preprocessing **

In [None]:
# Stemm the words
stemming = PorterStemmer()
data['tweets_prepr'] = data['tweets_prepr'].apply(lambda x: ' '.join([stemming.stem(word) for word in data.tweets_prepr]))
data['tweets_prepr'].head()

**STEP 5 Preprocessing **

In [None]:
# Removing stop-words: Create new column in data - without stopwords
stop = stopwords.words('english')

data['tweets_without_stop'] = data['tweets_prepr'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
data['tweets_prepr'].head()

**STEP 6 Preprocessing **

In [None]:
# Remove words less than three letters
tokens = [word for word in tokens if len(word) >= 3]
data['tweets'].head()

**STEP 7 Preprocessing **

In [None]:
# lower capitalization
tokens = [word.lower() for word in tokens]
data['tweets'].head()

**STEP 8 Preprocessing **

In [None]:
# lemmatize
lmtzr = WordNetLemmatizer()
tokens = [lmtzr.lemmatize(word) for word in tokens]
preprocessed_text= ' '.join(tokens)
data['tweets'].head()

**2. Descriptive: Word Frequencies**

**Word frequencies **

In [None]:
# get most common words (original column)
all_words = []
for line in list(data['tweets']):
    words = line.split()
    for word in words:
        all_words.append(word.lower())
    
    
Counter(all_words).most_common(25)

In [None]:
# plot word frequency distribution of first few words
plt.figure(figsize=(12,5))
plt.title('Top 25 most common words')
plt.xticks(fontsize=13, rotation=90)
fd = nltk.FreqDist(all_words)
fd.plot(25,cumulative=False)
# log-log plot
word_counts = sorted(Counter(all_words).values(), reverse=True)
plt.figure(figsize=(12,5))
plt.loglog(word_counts, linestyle='-', linewidth=1.5)
plt.ylabel("Freq")
plt.xlabel("Word Rank")
plt.title('log-log plot of words frequency')

**Note**:
- The word distribution present in this data dictionary is a very common phenomenon in large samples of words as shown by Zipf’s law where the most frequent word will occur about twice as often as the second most frequent word, three times as often as the third most frequent word, etc. (see: https://towardsdatascience.com/the-real-world-as-seen-on-twitter-sentiment-analysis-part-one-5ac2d06b63fb)

In [None]:
# get most common words (preprocessed tweet column)
all_words = []
for line in list(data['tweets_prepr']):
    words = line.split()
    for word in words:
        all_words.append(word.lower())
    
    
Counter(all_words).most_common(25)

In [None]:
# plot word frequency distribution of first few words after preprocessing
plt.figure(figsize=(12,5))
plt.title('Top 25 most common words')
plt.xticks(fontsize=13, rotation=90)
fd = nltk.FreqDist(all_words)
fd.plot(25,cumulative=False)

**Wordclouds **

In [None]:
all_words = []
for line in data['tweets']: 
    all_words.extend(line)
    
# create a word frequency dictionary
wordfreq = Counter(all_words)
# draw a Word Cloud with word frequencies
wordcloud = WordCloud(width=900,
                      height=500,
                      max_words=500,
                      max_font_size=100,
                      relative_scaling=0.5,
                      colormap='Blues',
                      normalize_plurals=True).generate_from_frequencies(wordfreq)
plt.figure(figsize=(17,14))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title("Wordcloud of tweets without removing stopwords")
plt.show()

In [None]:
all_words = []
for line in data['tweets_prepr']: 
    all_words.extend(line)
    
# create a word frequency dictionary
wordfreq = Counter(all_words)
# draw a Word Cloud with word frequencies
wordcloud = WordCloud(width=900,
                      height=500,
                      max_words=500,
                      max_font_size=100,
                      relative_scaling=0.5,
                      colormap='Blues',
                      normalize_plurals=True).generate_from_frequencies(wordfreq)
plt.figure(figsize=(17,14))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title("Wordcloud of tweets PREPROCESSED")
plt.show()

In [None]:
# remove junk from tweets
junk = re.compile("al|RT|\n|&.*?;|http[s](?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+)*")
tweets = [junk.sub(" ", t) for t in data.tweets]

# remove stop words from tweets
vec = TfidfVectorizer(stop_words='english', ngram_range=(1,2), max_df=.5)
tfv = vec.fit_transform(tweets)

terms = vec.get_feature_names()
wc = WordCloud(height=1000, width=1000, max_words=1000).generate(" ".join(terms))

plt.figure(figsize=(10, 10))
plt.imshow(wc)
plt.axis("off")
plt.title("Wordcloud of tweets AFTER removing stopwords")
plt.show()

**3. Tokenization**:
- word split
- bigrams
- term co-occurence
- sentiment analysis
- Deepmoji

**Word split **

In [None]:
# Version 1.1 Tokenization Split: Tokenize sentences into words: split into words
data['tweets_tok1'] = data.tweets.str.strip().str.split('[\W_]+')
data['tweets'].head()

In [None]:
# Version 1.2 Tokenization NLTK1: Tokenize sentences into words: split into words
tweet_tokenizer = nltk.tokenize.casual.TweetTokenizer()
data['tweets_tok2'] = data['tweets_prepr'].apply(tweet_tokenizer.tokenize)

tweet_tokenizer = nltk.tokenize.casual.TweetTokenizer()

In [None]:
# Version 1.2 Tokenization NLTK2: Tokenize sentences into words: split into words
tweet_tokenizer = nltk.tokenize.casual.TweetTokenizer()
data['tweets_tok3'] = data['tweets_prepr'].apply(tweet_tokenizer.tokenize)

**Bigrams **

**Term co-occurence **