<a id='Top'></a>
<center>
<h1><u>NLP with disaster Tweets - True or Fake?</u></h1>
<h3>Preprocessing, XGBoost model and ensemble</h3>
</center>

---


<!-- Start of Unsplash Embed Code - Centered (Embed code by @BirdyOz)-->
<div style="width:60%; margin: 20px 20% !important;">
    <img src="https://images.unsplash.com/photo-1578652229330-05f320786aa9?ixlib=rb-1.2.1&amp;q=80&amp;fm=jpg&amp;crop=entropy&amp;cs=tinysrgb&amp;w=720&amp;fit=max&amp;ixid=eyJhcHBfaWQiOjEyMDd9" class="img-responsive img-fluid img-med" alt="reflection of light on body of water at night " title="reflection of light on body of water at night ">
    <div class="text-muted" style="opacity: 0.5">
        <small><a href="https://unsplash.com/photos/nVYEechGqqM" target="_blank">Photo</a> by <a href="https://unsplash.com/@sippakorn" target="_blank">@sippakorn</a> on <a href="https://unsplash.com" target="_blank">Unsplash</a>, accessed 24/09/2020</small>
    </div>
</div>
<!-- End of Unsplash Embed code -->

### INTRODUCTION

The problem of fake news and disinformation was known much more before the advent on the Internet. It was used to mislead the enemy or to obtain an advantage in politics or economy over competitors. 

**Sun Tzu** wrote in his "The Art of War":   
**“The whole secret lies in confusing the enemy, so that he cannot fathom our real intent”**.  
It was written in 5th century BC but the principle stays still valid nowadays. The only difference is that currently, the Internet is a new battlefield. 

Easy access and popularity of various social media networks like Facebook or Tweeter give a terrific opportunity for dissemination of fake news. Of course, this applies also to various blogs and web pages. Many of these fake news can then sneak into mainstream news distribution channels like TV or the press. This happens currently more and more frequently – you can read many interesting examples on [Fighting Fake](http://www.fightingfake.org.uk/fake-news#ExamplesOfFakeNews-3) site. The problem became so widespread that this became a topic in cinematography as well – I recommend you watching for example “[The Hater](https://www.imdb.com/title/tt9506474/)” (2020) available on Netflix. 

The uncontrolled spread of fake news imposes a real threat not only individual politics and institutions but to all people as a society. Confusing the attacked nation or group of people may have an aim to divide them so they cannot be united. According to old rule *"divide and conquer"* this is a remarkably good move fo the attacker.

And that the reason it is of the utmost importance to protect us from this danger. There are many ways to achieve that, from the basics like problem awareness and news verification skills to advanced analytics backend systems. 

So here we are. Twitter false news dataset on which we can check or skills and learn how to separate real from fake news. This is a standard supervised classification task:  
* **Supervised** - the labels are provided and included in a training dataset.
* **Classification** - the labels are of binary type, 1 (true) and 0 (false).  

We will be working with text so this task requires NLP - "Natural Language Processing".

Note that the aim of this notebook is not to get the highest score in the competition as we will be working with "classic" scikit-learn machine learning algorithms. Much better results in this case you can obtain using deep learning approach with GloVe, BERT, etc.


### SECTIONS:  
1. [Reading Data](#Reading)<br>
2. [Exploratory Data Analysis](#EDA)<br>
3. [Tokenization and Features Engineering](#Tokenization)<br>
4. [Baseline models](#XGB)<br>
    4.1 XGB  
    4.2 AdaBoost  
    4.3 Random Forest Classifier  
    4.4 Extra Tree Classifier  
5. [Ensemble](#Ensemble)<br>

<a id='Reading'></a>
## 1. Reading data <a href='#Top' style="text-decoration: none;">^</a><br>  

Database is available in Comma Separated Values (.csv) file and can be easily read with python's pandas library. For visualisation we will use matplotlib, seaborn and wordcloud libraries.

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # basic visualisation library
import seaborn as sns # advanced and nice visualisations
import warnings # library to manage warnings
from wordcloud import WordCloud, STOPWORDS # library to create a wordcloud

warnings.simplefilter(action='ignore', category=FutureWarning) # silencing FutureWarnings out
warnings.simplefilter(action='ignore', category=UserWarning)

# printing out version of libraries we use
print("numpy version: {}".format(np.__version__))
print("pandas version: {}".format(pd.__version__))
print("seaborn version: {}".format(sns.__version__))

Reading train and test datasets.

In [None]:
train_data = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test_data = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')

print('There are {} rows and {} columns in a training dataset'.format(train_data.shape[0],train_data.shape[1]))
print('and {} rows and {} columns in a testing dataset.'.format(test_data.shape[0],test_data.shape[1]))

<a id='EDA'></a>
## 2. Exploratory Data Analysis <a href='#Top' style="text-decoration: none;">^</a><br>  

The first and crucial part in any data science project is to understand data we are working with. Therefore in this section we will perform basic Exploratory Data Analysis checking (among others) variables value ranges, types and distributions.

In [None]:
train_data.head()

Let's read some tweets.

In [None]:
for i in np.arange(0,10,1):
    print(i, train_data.text[i])

Checking target variables (binary) distribution.

In [None]:
target_counts = train_data['target'].value_counts().div(len(train_data)).mul(100) # calculating percentages of target values

ax = sns.barplot(target_counts.index, target_counts.values)
ax.set_xlabel('Target variable')
ax.set_ylabel('Percentage [%]')
ax.set_xticklabels(['False','True'])

ax.set_title('Training dataset', fontsize=13)
plt.show()

Our target variable is binary and not well balanced but for now, just for simplicity, we will leave it as it is. Alternatively, we can upsample/downsample target variable groups.  

Let's check how much data is missing in remaining columns.

In [None]:
missing_cols = ['keyword', 'location']

fig, ax = plt.subplots(1,2, figsize=(12, 5))
# calculating percent of missing data in each column
train_nans = train_data[missing_cols].isnull().sum()/len(train_data)*100  
test_nans = test_data[missing_cols].isnull().sum()/len(test_data)*100
# creating a barplot
sns.barplot(x=train_nans.index, y=train_nans.values, ax=ax[0])  
sns.barplot(x=test_nans.index, y=test_nans.values, ax=ax[1])

ax[0].set_ylabel('Missing Values [%]', size=15, labelpad=20)
ax[0].set_yticks(np.arange(0,40,5))
ax[0].set_ylim((0,35))

ax[0].set_title('Training dataset', fontsize=13)
ax[1].set_title('Testing dataset', fontsize=13)
plt.show()

Below there is barplot showing true ratio (true to false count) for TOP30 most *true* keywords.

In [None]:
fig, ax = plt.subplots(figsize=(12,6))
true_ratios = train_data.groupby('keyword')['target'].mean().sort_values(ascending=False).mul(100)
sns.barplot(x=true_ratios.index[:30], y=true_ratios.values[:30], ax=ax)
plt.xticks(rotation=90)
plt.title("TOP 30 most 'true' keywords")
plt.ylabel("True ratio [%]")
plt.show()

Apparently tweets containing keywords **derailment**, **wreckage** and **debris** are assosiated with true tweets.  

The same for TOP30 most *fake* keywords.

In [None]:
fig, ax = plt.subplots(figsize=(12,6))
true_ratios = train_data.groupby('keyword')['target'].mean().sort_values(ascending=True).mul(100)
sns.barplot(x=true_ratios.index[:30], y=true_ratios.values[:30], ax=ax)
plt.xticks(rotation=90)
plt.title("TOP 30 most 'fake' keywords")
plt.ylabel("True ratio [%]")
plt.show()

A keyword indicating always (in a training set) a fake tweet is *aftershock*.

<a id='Tokenization'></a>
## 3. Tokenization and Features Engineering<a href='#Top'>^</a><br>

In this phase we will pre-process the database by cleaning and feature engineering to generate new and usefull features for ML algorithms. 

A text decomposition (like tokenisation, stemminig and lemmatiation) is the most important operation and can be performed in many ways. Here open-source and popular NLTK library will be used (but there are many other like SpaCy, CoreNLP, gensim). The tokenisation in this notebook will use a lot of regex (regular expressions).  

Overall text preprocessing can include operations like:
* Tokenisation  - splitting text into a list of tokens
* Stemming - reduction of words to their roots, e.g. "does", "did", "done" will be all reduced to their root: "do"
* Stop Words Removal - many times words like "a", "an", "for", "from" are insignifican and can be removed to reduce the noise
* Features Extraction

**Text Cleaning**

Now we will clean tweets by deconstructing contractions like "I'm" to "I am", "won't" to "will not", etc. The entire process of cleaning tweets can be very complex and is very well described in [this Kaggle noteboook by Gunes Evitan](https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert).

In [None]:
import re
# removing contractions
def decontracted(tweet):
    # specific
    tweet = re.sub(r"won\'t", "will not", tweet)
    tweet = re.sub(r"can\'t", "can not", tweet)
    # general
    tweet = re.sub(r"n\'t", " not", tweet)
    tweet = re.sub(r"\'re", " are", tweet)
    tweet = re.sub(r"\'s", " is", tweet)
    tweet = re.sub(r"\'d", " would", tweet)
    tweet = re.sub(r"\'ll", " will", tweet)
    tweet = re.sub(r"\'t", " not", tweet)
    tweet = re.sub(r"\'ve", " have", tweet)
    tweet = re.sub(r"\'m", " am", tweet)
    return tweet

def special_chars(tweet):
    # Special characters
    tweet = re.sub(r"\x89Û_", "", tweet)
    tweet = re.sub(r"\x89ÛÒ", "", tweet)
    tweet = re.sub(r"\x89ÛÓ", "", tweet)
    tweet = re.sub(r"\x89ÛÏWhen", "When", tweet)
    tweet = re.sub(r"\x89ÛÏ", "", tweet)
    tweet = re.sub(r"China\x89Ûªs", "China's", tweet)
    tweet = re.sub(r"let\x89Ûªs", "let's", tweet)
    tweet = re.sub(r"\x89Û÷", "", tweet)
    tweet = re.sub(r"\x89Ûª", "", tweet)
    tweet = re.sub(r"\x89Û\x9d", "", tweet)
    tweet = re.sub(r"å_", "", tweet)
    tweet = re.sub(r"\x89Û¢", "", tweet)
    tweet = re.sub(r"\x89Û¢åÊ", "", tweet)
    tweet = re.sub(r"fromåÊwounds", "from wounds", tweet)
    tweet = re.sub(r"åÊ", "", tweet)
    tweet = re.sub(r"åÈ", "", tweet)
    tweet = re.sub(r"JapÌ_n", "Japan", tweet)    
    tweet = re.sub(r"Ì©", "e", tweet)
    tweet = re.sub(r"å¨", "", tweet)
    tweet = re.sub(r"SuruÌ¤", "Suruc", tweet)
    tweet = re.sub(r"åÇ", "", tweet)
    tweet = re.sub(r"å£3million", "3 million", tweet)
    tweet = re.sub(r"åÀ", "", tweet)
    return tweet


Below there is a long cleaning code taken from [noteboook by Gunes Evitan](https://www.kaggle.com/gunesevitan/nlp-with-disaster-tweets-eda-cleaning-and-bert). It's focused on cleaning this specific dataset - hastags, usernames, etc. It's improving quality of this static database but with a dynamic real-life dataset it's not the recommended approach. However, in the cited notebook the approach to the problem is different - by using vectors created on a specific corpuses (e.g. Wikipedia) and it definitely helps.

I'll use this cleaning section below (hidden cell).

In [None]:
# this is taken from notebook of Gunes Evitan "NLP with Disaster Tweets - EDA, Cleaning and BERT"
def various(tweet):    
    # Character entity references
    tweet = re.sub(r"&gt;", ">", tweet)
    tweet = re.sub(r"&lt;", "<", tweet)
    tweet = re.sub(r"&amp;", "&", tweet)
    
    # Typos, slang and informal abbreviations
    tweet = re.sub(r"w/e", "whatever", tweet)
    tweet = re.sub(r"w/", "with", tweet)
    tweet = re.sub(r"USAgov", "USA government", tweet)
    tweet = re.sub(r"recentlu", "recently", tweet)
    tweet = re.sub(r"Ph0tos", "Photos", tweet)
    tweet = re.sub(r"amirite", "am I right", tweet)
    tweet = re.sub(r"exp0sed", "exposed", tweet)
    tweet = re.sub(r"<3", "love", tweet)
    tweet = re.sub(r"amageddon", "armageddon", tweet)
    tweet = re.sub(r"Trfc", "Traffic", tweet)
    tweet = re.sub(r"8/5/2015", "2015-08-05", tweet)
    tweet = re.sub(r"WindStorm", "Wind Storm", tweet)
    tweet = re.sub(r"8/6/2015", "2015-08-06", tweet)
    tweet = re.sub(r"10:38PM", "10:38 PM", tweet)
    tweet = re.sub(r"10:30pm", "10:30 PM", tweet)
    tweet = re.sub(r"16yr", "16 year", tweet)
    tweet = re.sub(r"lmao", "laughing my ass off", tweet)   
    tweet = re.sub(r"TRAUMATISED", "traumatized", tweet)
    
    # Hashtags and usernames
    tweet = re.sub(r"IranDeal", "Iran Deal", tweet)
    tweet = re.sub(r"ArianaGrande", "Ariana Grande", tweet)
    tweet = re.sub(r"camilacabello97", "camila cabello", tweet) 
    tweet = re.sub(r"RondaRousey", "Ronda Rousey", tweet)     
    tweet = re.sub(r"MTVHottest", "MTV Hottest", tweet)
    tweet = re.sub(r"TrapMusic", "Trap Music", tweet)
    tweet = re.sub(r"ProphetMuhammad", "Prophet Muhammad", tweet)
    tweet = re.sub(r"PantherAttack", "Panther Attack", tweet)
    tweet = re.sub(r"StrategicPatience", "Strategic Patience", tweet)
    tweet = re.sub(r"socialnews", "social news", tweet)
    tweet = re.sub(r"NASAHurricane", "NASA Hurricane", tweet)
    tweet = re.sub(r"onlinecommunities", "online communities", tweet)
    tweet = re.sub(r"humanconsumption", "human consumption", tweet)
    tweet = re.sub(r"Typhoon-Devastated", "Typhoon Devastated", tweet)
    tweet = re.sub(r"Meat-Loving", "Meat Loving", tweet)
    tweet = re.sub(r"facialabuse", "facial abuse", tweet)
    tweet = re.sub(r"LakeCounty", "Lake County", tweet)
    tweet = re.sub(r"BeingAuthor", "Being Author", tweet)
    tweet = re.sub(r"withheavenly", "with heavenly", tweet)
    tweet = re.sub(r"thankU", "thank you", tweet)
    tweet = re.sub(r"iTunesMusic", "iTunes Music", tweet)
    tweet = re.sub(r"OffensiveContent", "Offensive Content", tweet)
    tweet = re.sub(r"WorstSummerJob", "Worst Summer Job", tweet)
    tweet = re.sub(r"HarryBeCareful", "Harry Be Careful", tweet)
    tweet = re.sub(r"NASASolarSystem", "NASA Solar System", tweet)
    tweet = re.sub(r"animalrescue", "animal rescue", tweet)
    tweet = re.sub(r"KurtSchlichter", "Kurt Schlichter", tweet)
    tweet = re.sub(r"aRmageddon", "armageddon", tweet)
    tweet = re.sub(r"Throwingknifes", "Throwing knives", tweet)
    tweet = re.sub(r"GodsLove", "God's Love", tweet)
    tweet = re.sub(r"bookboost", "book boost", tweet)
    tweet = re.sub(r"ibooklove", "I book love", tweet)
    tweet = re.sub(r"NestleIndia", "Nestle India", tweet)
    tweet = re.sub(r"realDonaldTrump", "Donald Trump", tweet)
    tweet = re.sub(r"DavidVonderhaar", "David Vonderhaar", tweet)
    tweet = re.sub(r"CecilTheLion", "Cecil The Lion", tweet)
    tweet = re.sub(r"weathernetwork", "weather network", tweet)
    tweet = re.sub(r"withBioterrorism&use", "with Bioterrorism & use", tweet)
    tweet = re.sub(r"Hostage&2", "Hostage & 2", tweet)
    tweet = re.sub(r"GOPDebate", "GOP Debate", tweet)
    tweet = re.sub(r"RickPerry", "Rick Perry", tweet)
    tweet = re.sub(r"frontpage", "front page", tweet)
    tweet = re.sub(r"NewsInTweets", "News In Tweets", tweet)
    tweet = re.sub(r"ViralSpell", "Viral Spell", tweet)
    tweet = re.sub(r"til_now", "until now", tweet)
    tweet = re.sub(r"volcanoinRussia", "volcano in Russia", tweet)
    tweet = re.sub(r"ZippedNews", "Zipped News", tweet)
    tweet = re.sub(r"MicheleBachman", "Michele Bachman", tweet)
    tweet = re.sub(r"53inch", "53 inch", tweet)
    tweet = re.sub(r"KerrickTrial", "Kerrick Trial", tweet)
    tweet = re.sub(r"abstorm", "Alberta Storm", tweet)
    tweet = re.sub(r"Beyhive", "Beyonce hive", tweet)
    tweet = re.sub(r"IDFire", "Idaho Fire", tweet)
    tweet = re.sub(r"DETECTADO", "Detected", tweet)
    tweet = re.sub(r"RockyFire", "Rocky Fire", tweet)
    tweet = re.sub(r"Listen/Buy", "Listen / Buy", tweet)
    tweet = re.sub(r"NickCannon", "Nick Cannon", tweet)
    tweet = re.sub(r"FaroeIslands", "Faroe Islands", tweet)
    tweet = re.sub(r"yycstorm", "Calgary Storm", tweet)
    tweet = re.sub(r"IDPs:", "Internally Displaced People :", tweet)
    tweet = re.sub(r"ArtistsUnited", "Artists United", tweet)
    tweet = re.sub(r"ClaytonBryant", "Clayton Bryant", tweet)
    tweet = re.sub(r"jimmyfallon", "jimmy fallon", tweet)
    tweet = re.sub(r"justinbieber", "justin bieber", tweet)  
    tweet = re.sub(r"UTC2015", "UTC 2015", tweet)
    tweet = re.sub(r"Time2015", "Time 2015", tweet)
    tweet = re.sub(r"djicemoon", "dj icemoon", tweet)
    tweet = re.sub(r"LivingSafely", "Living Safely", tweet)
    tweet = re.sub(r"FIFA16", "Fifa 2016", tweet)
    tweet = re.sub(r"thisiswhywecanthavenicethings", "this is why we cannot have nice things", tweet)
    tweet = re.sub(r"bbcnews", "bbc news", tweet)
    tweet = re.sub(r"UndergroundRailraod", "Underground Railraod", tweet)
    tweet = re.sub(r"c4news", "c4 news", tweet)
    tweet = re.sub(r"OBLITERATION", "obliteration", tweet)
    tweet = re.sub(r"MUDSLIDE", "mudslide", tweet)
    tweet = re.sub(r"NoSurrender", "No Surrender", tweet)
    tweet = re.sub(r"NotExplained", "Not Explained", tweet)
    tweet = re.sub(r"greatbritishbakeoff", "great british bake off", tweet)
    tweet = re.sub(r"LondonFire", "London Fire", tweet)
    tweet = re.sub(r"KOTAWeather", "KOTA Weather", tweet)
    tweet = re.sub(r"LuchaUnderground", "Lucha Underground", tweet)
    tweet = re.sub(r"KOIN6News", "KOIN 6 News", tweet)
    tweet = re.sub(r"LiveOnK2", "Live On K2", tweet)
    tweet = re.sub(r"9NewsGoldCoast", "9 News Gold Coast", tweet)
    tweet = re.sub(r"nikeplus", "nike plus", tweet)
    tweet = re.sub(r"david_cameron", "David Cameron", tweet)
    tweet = re.sub(r"peterjukes", "Peter Jukes", tweet)
    tweet = re.sub(r"JamesMelville", "James Melville", tweet)
    tweet = re.sub(r"megynkelly", "Megyn Kelly", tweet)
    tweet = re.sub(r"cnewslive", "C News Live", tweet)
    tweet = re.sub(r"JamaicaObserver", "Jamaica Observer", tweet)
    tweet = re.sub(r"TweetLikeItsSeptember11th2001", "Tweet like it is september 11th 2001", tweet)
    tweet = re.sub(r"cbplawyers", "cbp lawyers", tweet)
    tweet = re.sub(r"fewmoretweets", "few more tweets", tweet)
    tweet = re.sub(r"BlackLivesMatter", "Black Lives Matter", tweet)
    tweet = re.sub(r"cjoyner", "Chris Joyner", tweet)
    tweet = re.sub(r"ENGvAUS", "England vs Australia", tweet)
    tweet = re.sub(r"ScottWalker", "Scott Walker", tweet)
    tweet = re.sub(r"MikeParrActor", "Michael Parr", tweet)
    tweet = re.sub(r"4PlayThursdays", "Foreplay Thursdays", tweet)
    tweet = re.sub(r"TGF2015", "Tontitown Grape Festival", tweet)
    tweet = re.sub(r"realmandyrain", "Mandy Rain", tweet)
    tweet = re.sub(r"GraysonDolan", "Grayson Dolan", tweet)
    tweet = re.sub(r"ApolloBrown", "Apollo Brown", tweet)
    tweet = re.sub(r"saddlebrooke", "Saddlebrooke", tweet)
    tweet = re.sub(r"TontitownGrape", "Tontitown Grape", tweet)
    tweet = re.sub(r"AbbsWinston", "Abbs Winston", tweet)
    tweet = re.sub(r"ShaunKing", "Shaun King", tweet)
    tweet = re.sub(r"MeekMill", "Meek Mill", tweet)
    tweet = re.sub(r"TornadoGiveaway", "Tornado Giveaway", tweet)
    tweet = re.sub(r"GRupdates", "GR updates", tweet)
    tweet = re.sub(r"SouthDowns", "South Downs", tweet)
    tweet = re.sub(r"braininjury", "brain injury", tweet)
    tweet = re.sub(r"auspol", "Australian politics", tweet)
    tweet = re.sub(r"PlannedParenthood", "Planned Parenthood", tweet)
    tweet = re.sub(r"calgaryweather", "Calgary Weather", tweet)
    tweet = re.sub(r"weallheartonedirection", "we all heart one direction", tweet)
    tweet = re.sub(r"edsheeran", "Ed Sheeran", tweet)
    tweet = re.sub(r"TrueHeroes", "True Heroes", tweet)
    tweet = re.sub(r"S3XLEAK", "sex leak", tweet)
    tweet = re.sub(r"ComplexMag", "Complex Magazine", tweet)
    tweet = re.sub(r"TheAdvocateMag", "The Advocate Magazine", tweet)
    tweet = re.sub(r"CityofCalgary", "City of Calgary", tweet)
    tweet = re.sub(r"EbolaOutbreak", "Ebola Outbreak", tweet)
    tweet = re.sub(r"SummerFate", "Summer Fate", tweet)
    tweet = re.sub(r"RAmag", "Royal Academy Magazine", tweet)
    tweet = re.sub(r"offers2go", "offers to go", tweet)
    tweet = re.sub(r"foodscare", "food scare", tweet)
    tweet = re.sub(r"MNPDNashville", "Metropolitan Nashville Police Department", tweet)
    tweet = re.sub(r"TfLBusAlerts", "TfL Bus Alerts", tweet)
    tweet = re.sub(r"GamerGate", "Gamer Gate", tweet)
    tweet = re.sub(r"IHHen", "Humanitarian Relief", tweet)
    tweet = re.sub(r"spinningbot", "spinning bot", tweet)
    tweet = re.sub(r"ModiMinistry", "Modi Ministry", tweet)
    tweet = re.sub(r"TAXIWAYS", "taxi ways", tweet)
    tweet = re.sub(r"Calum5SOS", "Calum Hood", tweet)
    tweet = re.sub(r"po_st", "po.st", tweet)
    tweet = re.sub(r"scoopit", "scoop.it", tweet)
    tweet = re.sub(r"UltimaLucha", "Ultima Lucha", tweet)
    tweet = re.sub(r"JonathanFerrell", "Jonathan Ferrell", tweet)
    tweet = re.sub(r"aria_ahrary", "Aria Ahrary", tweet)
    tweet = re.sub(r"rapidcity", "Rapid City", tweet)
    tweet = re.sub(r"OutBid", "outbid", tweet)
    tweet = re.sub(r"lavenderpoetrycafe", "lavender poetry cafe", tweet)
    tweet = re.sub(r"EudryLantiqua", "Eudry Lantiqua", tweet)
    tweet = re.sub(r"15PM", "15 PM", tweet)
    tweet = re.sub(r"OriginalFunko", "Funko", tweet)
    tweet = re.sub(r"rightwaystan", "Richard Tan", tweet)
    tweet = re.sub(r"CindyNoonan", "Cindy Noonan", tweet)
    tweet = re.sub(r"RT_America", "RT America", tweet)
    tweet = re.sub(r"narendramodi", "Narendra Modi", tweet)
    tweet = re.sub(r"BakeOffFriends", "Bake Off Friends", tweet)
    tweet = re.sub(r"TeamHendrick", "Hendrick Motorsports", tweet)
    tweet = re.sub(r"alexbelloli", "Alex Belloli", tweet)
    tweet = re.sub(r"itsjustinstuart", "Justin Stuart", tweet)
    tweet = re.sub(r"gunsense", "gun sense", tweet)
    tweet = re.sub(r"DebateQuestionsWeWantToHear", "debate questions we want to hear", tweet)
    tweet = re.sub(r"RoyalCarribean", "Royal Carribean", tweet)
    tweet = re.sub(r"samanthaturne19", "Samantha Turner", tweet)
    tweet = re.sub(r"JonVoyage", "Jon Stewart", tweet)
    tweet = re.sub(r"renew911health", "renew 911 health", tweet)
    tweet = re.sub(r"SuryaRay", "Surya Ray", tweet)
    tweet = re.sub(r"pattonoswalt", "Patton Oswalt", tweet)
    tweet = re.sub(r"minhazmerchant", "Minhaz Merchant", tweet)
    tweet = re.sub(r"TLVFaces", "Israel Diaspora Coalition", tweet)
    tweet = re.sub(r"pmarca", "Marc Andreessen", tweet)
    tweet = re.sub(r"pdx911", "Portland Police", tweet)
    tweet = re.sub(r"jamaicaplain", "Jamaica Plain", tweet)
    tweet = re.sub(r"Japton", "Arkansas", tweet)
    tweet = re.sub(r"RouteComplex", "Route Complex", tweet)
    tweet = re.sub(r"INSubcontinent", "Indian Subcontinent", tweet)
    tweet = re.sub(r"NJTurnpike", "New Jersey Turnpike", tweet)
    tweet = re.sub(r"Politifiact", "PolitiFact", tweet)
    tweet = re.sub(r"Hiroshima70", "Hiroshima", tweet)
    tweet = re.sub(r"GMMBC", "Greater Mt Moriah Baptist Church", tweet)
    tweet = re.sub(r"versethe", "verse the", tweet)
    tweet = re.sub(r"TubeStrike", "Tube Strike", tweet)
    tweet = re.sub(r"MissionHills", "Mission Hills", tweet)
    tweet = re.sub(r"ProtectDenaliWolves", "Protect Denali Wolves", tweet)
    tweet = re.sub(r"NANKANA", "Nankana", tweet)
    tweet = re.sub(r"SAHIB", "Sahib", tweet)
    tweet = re.sub(r"PAKPATTAN", "Pakpattan", tweet)
    tweet = re.sub(r"Newz_Sacramento", "News Sacramento", tweet)
    tweet = re.sub(r"gofundme", "go fund me", tweet)
    tweet = re.sub(r"pmharper", "Stephen Harper", tweet)
    tweet = re.sub(r"IvanBerroa", "Ivan Berroa", tweet)
    tweet = re.sub(r"LosDelSonido", "Los Del Sonido", tweet)
    tweet = re.sub(r"bancodeseries", "banco de series", tweet)
    tweet = re.sub(r"timkaine", "Tim Kaine", tweet)
    tweet = re.sub(r"IdentityTheft", "Identity Theft", tweet)
    tweet = re.sub(r"AllLivesMatter", "All Lives Matter", tweet)
    tweet = re.sub(r"mishacollins", "Misha Collins", tweet)
    tweet = re.sub(r"BillNeelyNBC", "Bill Neely", tweet)
    tweet = re.sub(r"BeClearOnCancer", "be clear on cancer", tweet)
    tweet = re.sub(r"Kowing", "Knowing", tweet)
    tweet = re.sub(r"ScreamQueens", "Scream Queens", tweet)
    tweet = re.sub(r"AskCharley", "Ask Charley", tweet)
    tweet = re.sub(r"BlizzHeroes", "Heroes of the Storm", tweet)
    tweet = re.sub(r"BradleyBrad47", "Bradley Brad", tweet)
    tweet = re.sub(r"HannaPH", "Typhoon Hanna", tweet)
    tweet = re.sub(r"meinlcymbals", "MEINL Cymbals", tweet)
    tweet = re.sub(r"Ptbo", "Peterborough", tweet)
    tweet = re.sub(r"cnnbrk", "CNN Breaking News", tweet)
    tweet = re.sub(r"IndianNews", "Indian News", tweet)
    tweet = re.sub(r"savebees", "save bees", tweet)
    tweet = re.sub(r"GreenHarvard", "Green Harvard", tweet)
    tweet = re.sub(r"StandwithPP", "Stand with planned parenthood", tweet)
    tweet = re.sub(r"hermancranston", "Herman Cranston", tweet)
    tweet = re.sub(r"WMUR9", "WMUR-TV", tweet)
    tweet = re.sub(r"RockBottomRadFM", "Rock Bottom Radio", tweet)
    tweet = re.sub(r"ameenshaikh3", "Ameen Shaikh", tweet)
    tweet = re.sub(r"ProSyn", "Project Syndicate", tweet)
    tweet = re.sub(r"Daesh", "ISIS", tweet)
    tweet = re.sub(r"s2g", "swear to god", tweet)
    tweet = re.sub(r"listenlive", "listen live", tweet)
    tweet = re.sub(r"CDCgov", "Centers for Disease Control and Prevention", tweet)
    tweet = re.sub(r"FoxNew", "Fox News", tweet)
    tweet = re.sub(r"CBSBigBrother", "Big Brother", tweet)
    tweet = re.sub(r"JulieDiCaro", "Julie DiCaro", tweet)
    tweet = re.sub(r"theadvocatemag", "The Advocate Magazine", tweet)
    tweet = re.sub(r"RohnertParkDPS", "Rohnert Park Police Department", tweet)
    tweet = re.sub(r"THISIZBWRIGHT", "Bonnie Wright", tweet)
    tweet = re.sub(r"Popularmmos", "Popular MMOs", tweet)
    tweet = re.sub(r"WildHorses", "Wild Horses", tweet)
    tweet = re.sub(r"FantasticFour", "Fantastic Four", tweet)
    tweet = re.sub(r"HORNDALE", "Horndale", tweet)
    tweet = re.sub(r"PINER", "Piner", tweet)
    tweet = re.sub(r"BathAndNorthEastSomerset", "Bath and North East Somerset", tweet)
    tweet = re.sub(r"thatswhatfriendsarefor", "that is what friends are for", tweet)
    tweet = re.sub(r"residualincome", "residual income", tweet)
    tweet = re.sub(r"YahooNewsDigest", "Yahoo News Digest", tweet)
    tweet = re.sub(r"MalaysiaAirlines", "Malaysia Airlines", tweet)
    tweet = re.sub(r"AmazonDeals", "Amazon Deals", tweet)
    tweet = re.sub(r"MissCharleyWebb", "Charley Webb", tweet)
    tweet = re.sub(r"shoalstraffic", "shoals traffic", tweet)
    tweet = re.sub(r"GeorgeFoster72", "George Foster", tweet)
    tweet = re.sub(r"pop2015", "pop 2015", tweet)
    tweet = re.sub(r"_PokemonCards_", "Pokemon Cards", tweet)
    tweet = re.sub(r"DianneG", "Dianne Gallagher", tweet)
    tweet = re.sub(r"KashmirConflict", "Kashmir Conflict", tweet)
    tweet = re.sub(r"BritishBakeOff", "British Bake Off", tweet)
    tweet = re.sub(r"FreeKashmir", "Free Kashmir", tweet)
    tweet = re.sub(r"mattmosley", "Matt Mosley", tweet)
    tweet = re.sub(r"BishopFred", "Bishop Fred", tweet)
    tweet = re.sub(r"EndConflict", "End Conflict", tweet)
    tweet = re.sub(r"EndOccupation", "End Occupation", tweet)
    tweet = re.sub(r"UNHEALED", "unhealed", tweet)
    tweet = re.sub(r"CharlesDagnall", "Charles Dagnall", tweet)
    tweet = re.sub(r"Latestnews", "Latest news", tweet)
    tweet = re.sub(r"KindleCountdown", "Kindle Countdown", tweet)
    tweet = re.sub(r"NoMoreHandouts", "No More Handouts", tweet)
    tweet = re.sub(r"datingtips", "dating tips", tweet)
    tweet = re.sub(r"charlesadler", "Charles Adler", tweet)
    tweet = re.sub(r"twia", "Texas Windstorm Insurance Association", tweet)
    tweet = re.sub(r"txlege", "Texas Legislature", tweet)
    tweet = re.sub(r"WindstormInsurer", "Windstorm Insurer", tweet)
    tweet = re.sub(r"Newss", "News", tweet)
    tweet = re.sub(r"hempoil", "hemp oil", tweet)
    tweet = re.sub(r"CommoditiesAre", "Commodities are", tweet)
    tweet = re.sub(r"tubestrike", "tube strike", tweet)
    tweet = re.sub(r"JoeNBC", "Joe Scarborough", tweet)
    tweet = re.sub(r"LiteraryCakes", "Literary Cakes", tweet)
    tweet = re.sub(r"TI5", "The International 5", tweet)
    tweet = re.sub(r"thehill", "the hill", tweet)
    tweet = re.sub(r"3others", "3 others", tweet)
    tweet = re.sub(r"stighefootball", "Sam Tighe", tweet)
    tweet = re.sub(r"whatstheimportantvideo", "what is the important video", tweet)
    tweet = re.sub(r"ClaudioMeloni", "Claudio Meloni", tweet)
    tweet = re.sub(r"DukeSkywalker", "Duke Skywalker", tweet)
    tweet = re.sub(r"carsonmwr", "Fort Carson", tweet)
    tweet = re.sub(r"offdishduty", "off dish duty", tweet)
    tweet = re.sub(r"andword", "and word", tweet)
    tweet = re.sub(r"rhodeisland", "Rhode Island", tweet)
    tweet = re.sub(r"easternoregon", "Eastern Oregon", tweet)
    tweet = re.sub(r"WAwildfire", "Washington Wildfire", tweet)
    tweet = re.sub(r"fingerrockfire", "Finger Rock Fire", tweet)
    tweet = re.sub(r"57am", "57 am", tweet)
    tweet = re.sub(r"fingerrockfire", "Finger Rock Fire", tweet)
    tweet = re.sub(r"JacobHoggard", "Jacob Hoggard", tweet)
    tweet = re.sub(r"newnewnew", "new new new", tweet)
    tweet = re.sub(r"under50", "under 50", tweet)
    tweet = re.sub(r"getitbeforeitsgone", "get it before it is gone", tweet)
    tweet = re.sub(r"freshoutofthebox", "fresh out of the box", tweet)
    tweet = re.sub(r"amwriting", "am writing", tweet)
    tweet = re.sub(r"Bokoharm", "Boko Haram", tweet)
    tweet = re.sub(r"Nowlike", "Now like", tweet)
    tweet = re.sub(r"seasonfrom", "season from", tweet)
    tweet = re.sub(r"epicente", "epicenter", tweet)
    tweet = re.sub(r"epicenterr", "epicenter", tweet)
    tweet = re.sub(r"sicklife", "sick life", tweet)
    tweet = re.sub(r"yycweather", "Calgary Weather", tweet)
    tweet = re.sub(r"calgarysun", "Calgary Sun", tweet)
    tweet = re.sub(r"approachng", "approaching", tweet)
    tweet = re.sub(r"evng", "evening", tweet)
    tweet = re.sub(r"Sumthng", "something", tweet)
    tweet = re.sub(r"EllenPompeo", "Ellen Pompeo", tweet)
    tweet = re.sub(r"shondarhimes", "Shonda Rhimes", tweet)
    tweet = re.sub(r"ABCNetwork", "ABC Network", tweet)
    tweet = re.sub(r"SushmaSwaraj", "Sushma Swaraj", tweet)
    tweet = re.sub(r"pray4japan", "Pray for Japan", tweet)
    tweet = re.sub(r"hope4japan", "Hope for Japan", tweet)
    tweet = re.sub(r"Illusionimagess", "Illusion images", tweet)
    tweet = re.sub(r"SummerUnderTheStars", "Summer Under The Stars", tweet)
    tweet = re.sub(r"ShallWeDance", "Shall We Dance", tweet)
    tweet = re.sub(r"TCMParty", "TCM Party", tweet)
    tweet = re.sub(r"marijuananews", "marijuana news", tweet)
    tweet = re.sub(r"onbeingwithKristaTippett", "on being with Krista Tippett", tweet)
    tweet = re.sub(r"Beingtweets", "Being tweets", tweet)
    tweet = re.sub(r"newauthors", "new authors", tweet)
    tweet = re.sub(r"remedyyyy", "remedy", tweet)
    tweet = re.sub(r"44PM", "44 PM", tweet)
    tweet = re.sub(r"HeadlinesApp", "Headlines App", tweet)
    tweet = re.sub(r"40PM", "40 PM", tweet)
    tweet = re.sub(r"myswc", "Severe Weather Center", tweet)
    tweet = re.sub(r"ithats", "that is", tweet)
    tweet = re.sub(r"icouldsitinthismomentforever", "I could sit in this moment forever", tweet)
    tweet = re.sub(r"FatLoss", "Fat Loss", tweet)
    tweet = re.sub(r"02PM", "02 PM", tweet)
    tweet = re.sub(r"MetroFmTalk", "Metro Fm Talk", tweet)
    tweet = re.sub(r"Bstrd", "bastard", tweet)
    tweet = re.sub(r"bldy", "bloody", tweet)
    tweet = re.sub(r"MetrofmTalk", "Metro Fm Talk", tweet)
    tweet = re.sub(r"terrorismturn", "terrorism turn", tweet)
    tweet = re.sub(r"BBCNewsAsia", "BBC News Asia", tweet)
    tweet = re.sub(r"BehindTheScenes", "Behind The Scenes", tweet)
    tweet = re.sub(r"GeorgeTakei", "George Takei", tweet)
    tweet = re.sub(r"WomensWeeklyMag", "Womens Weekly Magazine", tweet)
    tweet = re.sub(r"SurvivorsGuidetoEarth", "Survivors Guide to Earth", tweet)
    tweet = re.sub(r"incubusband", "incubus band", tweet)
    tweet = re.sub(r"Babypicturethis", "Baby picture this", tweet)
    tweet = re.sub(r"BombEffects", "Bomb Effects", tweet)
    tweet = re.sub(r"win10", "Windows 10", tweet)
    tweet = re.sub(r"idkidk", "I do not know I do not know", tweet)
    tweet = re.sub(r"TheWalkingDead", "The Walking Dead", tweet)
    tweet = re.sub(r"amyschumer", "Amy Schumer", tweet)
    tweet = re.sub(r"crewlist", "crew list", tweet)
    tweet = re.sub(r"Erdogans", "Erdogan", tweet)
    tweet = re.sub(r"BBCLive", "BBC Live", tweet)
    tweet = re.sub(r"TonyAbbottMHR", "Tony Abbott", tweet)
    tweet = re.sub(r"paulmyerscough", "Paul Myerscough", tweet)
    tweet = re.sub(r"georgegallagher", "George Gallagher", tweet)
    tweet = re.sub(r"JimmieJohnson", "Jimmie Johnson", tweet)
    tweet = re.sub(r"pctool", "pc tool", tweet)
    tweet = re.sub(r"DoingHashtagsRight", "Doing Hashtags Right", tweet)
    tweet = re.sub(r"ThrowbackThursday", "Throwback Thursday", tweet)
    tweet = re.sub(r"SnowBackSunday", "Snowback Sunday", tweet)
    tweet = re.sub(r"LakeEffect", "Lake Effect", tweet)
    tweet = re.sub(r"RTphotographyUK", "Richard Thomas Photography UK", tweet)
    tweet = re.sub(r"BigBang_CBS", "Big Bang CBS", tweet)
    tweet = re.sub(r"writerslife", "writers life", tweet)
    tweet = re.sub(r"NaturalBirth", "Natural Birth", tweet)
    tweet = re.sub(r"UnusualWords", "Unusual Words", tweet)
    tweet = re.sub(r"wizkhalifa", "Wiz Khalifa", tweet)
    tweet = re.sub(r"acreativedc", "a creative DC", tweet)
    tweet = re.sub(r"vscodc", "vsco DC", tweet)
    tweet = re.sub(r"VSCOcam", "vsco camera", tweet)
    tweet = re.sub(r"TheBEACHDC", "The beach DC", tweet)
    tweet = re.sub(r"buildingmuseum", "building museum", tweet)
    tweet = re.sub(r"WorldOil", "World Oil", tweet)
    tweet = re.sub(r"redwedding", "red wedding", tweet)
    tweet = re.sub(r"AmazingRaceCanada", "Amazing Race Canada", tweet)
    tweet = re.sub(r"WakeUpAmerica", "Wake Up America", tweet)
    tweet = re.sub(r"\\Allahuakbar\\", "Allahu Akbar", tweet)
    tweet = re.sub(r"bleased", "blessed", tweet)
    tweet = re.sub(r"nigeriantribune", "Nigerian Tribune", tweet)
    tweet = re.sub(r"HIDEO_KOJIMA_EN", "Hideo Kojima", tweet)
    tweet = re.sub(r"FusionFestival", "Fusion Festival", tweet)
    tweet = re.sub(r"50Mixed", "50 Mixed", tweet)
    tweet = re.sub(r"NoAgenda", "No Agenda", tweet)
    tweet = re.sub(r"WhiteGenocide", "White Genocide", tweet)
    tweet = re.sub(r"dirtylying", "dirty lying", tweet)
    tweet = re.sub(r"SyrianRefugees", "Syrian Refugees", tweet)
    tweet = re.sub(r"changetheworld", "change the world", tweet)
    tweet = re.sub(r"Ebolacase", "Ebola case", tweet)
    tweet = re.sub(r"mcgtech", "mcg technologies", tweet)
    tweet = re.sub(r"withweapons", "with weapons", tweet)
    tweet = re.sub(r"advancedwarfare", "advanced warfare", tweet)
    tweet = re.sub(r"letsFootball", "let us Football", tweet)
    tweet = re.sub(r"LateNiteMix", "late night mix", tweet)
    tweet = re.sub(r"PhilCollinsFeed", "Phil Collins", tweet)
    tweet = re.sub(r"RudyHavenstein", "Rudy Havenstein", tweet)
    tweet = re.sub(r"22PM", "22 PM", tweet)
    tweet = re.sub(r"54am", "54 AM", tweet)
    tweet = re.sub(r"38am", "38 AM", tweet)
    tweet = re.sub(r"OldFolkExplainStuff", "Old Folk Explain Stuff", tweet)
    tweet = re.sub(r"BlacklivesMatter", "Black Lives Matter", tweet)
    tweet = re.sub(r"InsaneLimits", "Insane Limits", tweet)
    tweet = re.sub(r"youcantsitwithus", "you cannot sit with us", tweet)
    tweet = re.sub(r"2k15", "2015", tweet)
    tweet = re.sub(r"TheIran", "Iran", tweet)
    tweet = re.sub(r"JimmyFallon", "Jimmy Fallon", tweet)
    tweet = re.sub(r"AlbertBrooks", "Albert Brooks", tweet)
    tweet = re.sub(r"defense_news", "defense news", tweet)
    tweet = re.sub(r"nuclearrcSA", "Nuclear Risk Control Self Assessment", tweet)
    tweet = re.sub(r"Auspol", "Australia Politics", tweet)
    tweet = re.sub(r"NuclearPower", "Nuclear Power", tweet)
    tweet = re.sub(r"WhiteTerrorism", "White Terrorism", tweet)
    tweet = re.sub(r"truthfrequencyradio", "Truth Frequency Radio", tweet)
    tweet = re.sub(r"ErasureIsNotEquality", "Erasure is not equality", tweet)
    tweet = re.sub(r"ProBonoNews", "Pro Bono News", tweet)
    tweet = re.sub(r"JakartaPost", "Jakarta Post", tweet)
    tweet = re.sub(r"toopainful", "too painful", tweet)
    tweet = re.sub(r"melindahaunton", "Melinda Haunton", tweet)
    tweet = re.sub(r"NoNukes", "No Nukes", tweet)
    tweet = re.sub(r"curryspcworld", "Currys PC World", tweet)
    tweet = re.sub(r"ineedcake", "I need cake", tweet)
    tweet = re.sub(r"blackforestgateau", "black forest gateau", tweet)
    tweet = re.sub(r"BBCOne", "BBC One", tweet)
    tweet = re.sub(r"AlexxPage", "Alex Page", tweet)
    tweet = re.sub(r"jonathanserrie", "Jonathan Serrie", tweet)
    tweet = re.sub(r"SocialJerkBlog", "Social Jerk Blog", tweet)
    tweet = re.sub(r"ChelseaVPeretti", "Chelsea Peretti", tweet)
    tweet = re.sub(r"irongiant", "iron giant", tweet)
    tweet = re.sub(r"RonFunches", "Ron Funches", tweet)
    tweet = re.sub(r"TimCook", "Tim Cook", tweet)
    tweet = re.sub(r"sebastianstanisaliveandwell", "Sebastian Stan is alive and well", tweet)
    tweet = re.sub(r"Madsummer", "Mad summer", tweet)
    tweet = re.sub(r"NowYouKnow", "Now you know", tweet)
    tweet = re.sub(r"concertphotography", "concert photography", tweet)
    tweet = re.sub(r"TomLandry", "Tom Landry", tweet)
    tweet = re.sub(r"showgirldayoff", "show girl day off", tweet)
    tweet = re.sub(r"Yougslavia", "Yugoslavia", tweet)
    tweet = re.sub(r"QuantumDataInformatics", "Quantum Data Informatics", tweet)
    tweet = re.sub(r"FromTheDesk", "From The Desk", tweet)
    tweet = re.sub(r"TheaterTrial", "Theater Trial", tweet)
    tweet = re.sub(r"CatoInstitute", "Cato Institute", tweet)
    tweet = re.sub(r"EmekaGift", "Emeka Gift", tweet)
    tweet = re.sub(r"LetsBe_Rational", "Let us be rational", tweet)
    tweet = re.sub(r"Cynicalreality", "Cynical reality", tweet)
    tweet = re.sub(r"FredOlsenCruise", "Fred Olsen Cruise", tweet)
    tweet = re.sub(r"NotSorry", "not sorry", tweet)
    tweet = re.sub(r"UseYourWords", "use your words", tweet)
    tweet = re.sub(r"WordoftheDay", "word of the day", tweet)
    tweet = re.sub(r"Dictionarycom", "Dictionary.com", tweet)
    tweet = re.sub(r"TheBrooklynLife", "The Brooklyn Life", tweet)
    tweet = re.sub(r"jokethey", "joke they", tweet)
    tweet = re.sub(r"nflweek1picks", "NFL week 1 picks", tweet)
    tweet = re.sub(r"uiseful", "useful", tweet)
    tweet = re.sub(r"JusticeDotOrg", "The American Association for Justice", tweet)
    tweet = re.sub(r"autoaccidents", "auto accidents", tweet)
    tweet = re.sub(r"SteveGursten", "Steve Gursten", tweet)
    tweet = re.sub(r"MichiganAutoLaw", "Michigan Auto Law", tweet)
    tweet = re.sub(r"birdgang", "bird gang", tweet)
    tweet = re.sub(r"nflnetwork", "NFL Network", tweet)
    tweet = re.sub(r"NYDNSports", "NY Daily News Sports", tweet)
    tweet = re.sub(r"RVacchianoNYDN", "Ralph Vacchiano NY Daily News", tweet)
    tweet = re.sub(r"EdmontonEsks", "Edmonton Eskimos", tweet)
    tweet = re.sub(r"david_brelsford", "David Brelsford", tweet)
    tweet = re.sub(r"TOI_India", "The Times of India", tweet)
    tweet = re.sub(r"hegot", "he got", tweet)
    tweet = re.sub(r"SkinsOn9", "Skins on 9", tweet)
    tweet = re.sub(r"sothathappened", "so that happened", tweet)
    tweet = re.sub(r"LCOutOfDoors", "LC Out Of Doors", tweet)
    tweet = re.sub(r"NationFirst", "Nation First", tweet)
    tweet = re.sub(r"IndiaToday", "India Today", tweet)
    tweet = re.sub(r"HLPS", "helps", tweet)
    tweet = re.sub(r"HOSTAGESTHROSW", "hostages throw", tweet)
    tweet = re.sub(r"SNCTIONS", "sanctions", tweet)
    tweet = re.sub(r"BidTime", "Bid Time", tweet)
    tweet = re.sub(r"crunchysensible", "crunchy sensible", tweet)
    tweet = re.sub(r"RandomActsOfRomance", "Random acts of romance", tweet)
    tweet = re.sub(r"MomentsAtHill", "Moments at hill", tweet)
    tweet = re.sub(r"eatshit", "eat shit", tweet)
    tweet = re.sub(r"liveleakfun", "live leak fun", tweet)
    tweet = re.sub(r"SahelNews", "Sahel News", tweet)
    tweet = re.sub(r"abc7newsbayarea", "ABC 7 News Bay Area", tweet)
    tweet = re.sub(r"facilitiesmanagement", "facilities management", tweet)
    tweet = re.sub(r"facilitydude", "facility dude", tweet)
    tweet = re.sub(r"CampLogistics", "Camp logistics", tweet)
    tweet = re.sub(r"alaskapublic", "Alaska public", tweet)
    tweet = re.sub(r"MarketResearch", "Market Research", tweet)
    tweet = re.sub(r"AccuracyEsports", "Accuracy Esports", tweet)
    tweet = re.sub(r"TheBodyShopAust", "The Body Shop Australia", tweet)
    tweet = re.sub(r"yychail", "Calgary hail", tweet)
    tweet = re.sub(r"yyctraffic", "Calgary traffic", tweet)
    tweet = re.sub(r"eliotschool", "eliot school", tweet)
    tweet = re.sub(r"TheBrokenCity", "The Broken City", tweet)
    tweet = re.sub(r"OldsFireDept", "Olds Fire Department", tweet)
    tweet = re.sub(r"RiverComplex", "River Complex", tweet)
    tweet = re.sub(r"fieldworksmells", "field work smells", tweet)
    tweet = re.sub(r"IranElection", "Iran Election", tweet)
    tweet = re.sub(r"glowng", "glowing", tweet)
    tweet = re.sub(r"kindlng", "kindling", tweet)
    tweet = re.sub(r"riggd", "rigged", tweet)
    tweet = re.sub(r"slownewsday", "slow news day", tweet)
    tweet = re.sub(r"MyanmarFlood", "Myanmar Flood", tweet)
    tweet = re.sub(r"abc7chicago", "ABC 7 Chicago", tweet)
    tweet = re.sub(r"copolitics", "Colorado Politics", tweet)
    tweet = re.sub(r"AdilGhumro", "Adil Ghumro", tweet)
    tweet = re.sub(r"netbots", "net bots", tweet)
    tweet = re.sub(r"byebyeroad", "bye bye road", tweet)
    tweet = re.sub(r"massiveflooding", "massive flooding", tweet)
    tweet = re.sub(r"EndofUS", "End of United States", tweet)
    tweet = re.sub(r"35PM", "35 PM", tweet)
    tweet = re.sub(r"greektheatrela", "Greek Theatre Los Angeles", tweet)
    tweet = re.sub(r"76mins", "76 minutes", tweet)
    tweet = re.sub(r"publicsafetyfirst", "public safety first", tweet)
    tweet = re.sub(r"livesmatter", "lives matter", tweet)
    tweet = re.sub(r"myhometown", "my hometown", tweet)
    tweet = re.sub(r"tankerfire", "tanker fire", tweet)
    tweet = re.sub(r"MEMORIALDAY", "memorial day", tweet)
    tweet = re.sub(r"MEMORIAL_DAY", "memorial day", tweet)
    tweet = re.sub(r"instaxbooty", "instagram booty", tweet)
    tweet = re.sub(r"Jerusalem_Post", "Jerusalem Post", tweet)
    tweet = re.sub(r"WayneRooney_INA", "Wayne Rooney", tweet)
    tweet = re.sub(r"VirtualReality", "Virtual Reality", tweet)
    tweet = re.sub(r"OculusRift", "Oculus Rift", tweet)
    tweet = re.sub(r"OwenJones84", "Owen Jones", tweet)
    tweet = re.sub(r"jeremycorbyn", "Jeremy Corbyn", tweet)
    tweet = re.sub(r"paulrogers002", "Paul Rogers", tweet)
    tweet = re.sub(r"mortalkombatx", "Mortal Kombat X", tweet)
    tweet = re.sub(r"mortalkombat", "Mortal Kombat", tweet)
    tweet = re.sub(r"FilipeCoelho92", "Filipe Coelho", tweet)
    tweet = re.sub(r"OnlyQuakeNews", "Only Quake News", tweet)
    tweet = re.sub(r"kostumes", "costumes", tweet)
    tweet = re.sub(r"YEEESSSS", "yes", tweet)
    tweet = re.sub(r"ToshikazuKatayama", "Toshikazu Katayama", tweet)
    tweet = re.sub(r"IntlDevelopment", "Intl Development", tweet)
    tweet = re.sub(r"ExtremeWeather", "Extreme Weather", tweet)
    tweet = re.sub(r"WereNotGruberVoters", "We are not gruber voters", tweet)
    tweet = re.sub(r"NewsThousands", "News Thousands", tweet)
    tweet = re.sub(r"EdmundAdamus", "Edmund Adamus", tweet)
    tweet = re.sub(r"EyewitnessWV", "Eye witness WV", tweet)
    tweet = re.sub(r"PhiladelphiaMuseu", "Philadelphia Museum", tweet)
    tweet = re.sub(r"DublinComicCon", "Dublin Comic Con", tweet)
    tweet = re.sub(r"NicholasBrendon", "Nicholas Brendon", tweet)
    tweet = re.sub(r"Alltheway80s", "All the way 80s", tweet)
    tweet = re.sub(r"FromTheField", "From the field", tweet)
    tweet = re.sub(r"NorthIowa", "North Iowa", tweet)
    tweet = re.sub(r"WillowFire", "Willow Fire", tweet)
    tweet = re.sub(r"MadRiverComplex", "Mad River Complex", tweet)
    tweet = re.sub(r"feelingmanly", "feeling manly", tweet)
    tweet = re.sub(r"stillnotoverit", "still not over it", tweet)
    tweet = re.sub(r"FortitudeValley", "Fortitude Valley", tweet)
    tweet = re.sub(r"CoastpowerlineTramTr", "Coast powerline", tweet)
    tweet = re.sub(r"ServicesGold", "Services Gold", tweet)
    tweet = re.sub(r"NewsbrokenEmergency", "News broken emergency", tweet)
    tweet = re.sub(r"Evaucation", "evacuation", tweet)
    tweet = re.sub(r"leaveevacuateexitbe", "leave evacuate exit be", tweet)
    tweet = re.sub(r"P_EOPLE", "PEOPLE", tweet)
    tweet = re.sub(r"Tubestrike", "tube strike", tweet)
    tweet = re.sub(r"CLASS_SICK", "CLASS SICK", tweet)
    tweet = re.sub(r"localplumber", "local plumber", tweet)
    tweet = re.sub(r"awesomejobsiri", "awesome job siri", tweet)
    tweet = re.sub(r"PayForItHow", "Pay for it how", tweet)
    tweet = re.sub(r"ThisIsAfrica", "This is Africa", tweet)
    tweet = re.sub(r"crimeairnetwork", "crime air network", tweet)
    tweet = re.sub(r"KimAcheson", "Kim Acheson", tweet)
    tweet = re.sub(r"cityofcalgary", "City of Calgary", tweet)
    tweet = re.sub(r"prosyndicate", "pro syndicate", tweet)
    tweet = re.sub(r"660NEWS", "660 NEWS", tweet)
    tweet = re.sub(r"BusInsMagazine", "Business Insurance Magazine", tweet)
    tweet = re.sub(r"wfocus", "focus", tweet)
    tweet = re.sub(r"ShastaDam", "Shasta Dam", tweet)
    tweet = re.sub(r"go2MarkFranco", "Mark Franco", tweet)
    tweet = re.sub(r"StephGHinojosa", "Steph Hinojosa", tweet)
    tweet = re.sub(r"Nashgrier", "Nash Grier", tweet)
    tweet = re.sub(r"NashNewVideo", "Nash new video", tweet)
    tweet = re.sub(r"IWouldntGetElectedBecause", "I would not get elected because", tweet)
    tweet = re.sub(r"SHGames", "Sledgehammer Games", tweet)
    tweet = re.sub(r"bedhair", "bed hair", tweet)
    tweet = re.sub(r"JoelHeyman", "Joel Heyman", tweet)
    tweet = re.sub(r"viaYouTube", "via YouTube", tweet)
    return tweet

The cleaing function below can be written as a one-liner but I decomposed it for easier debugging and maintanance.

In [None]:
# function definitions are in hidden cells above (long)
def cleaning_1(data):
    data['text_no_contr'] = data['text'].apply(decontracted) # removing contractions
    data['text_clean'] = data['text_no_contr'].apply(special_chars) # correcting special characters
    data['text'] = data['text_clean'].apply(various) # applying remaining cleaning functions

cleaning_1(train_data)
cleaning_1(test_data)

Below there are 5 new features that will be created:
1. words count (all)
2. numbers count
3. hashtags count
4. mentions count
5. mean length of words  

To create them we will need tokenizers from the NLTK library. Analysing Tweeter content is so popular task that NLTK contains a dedicated tokenizer for tweets.

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize, regexp_tokenize # functions for standard tokenisation
from nltk.tokenize import TweetTokenizer # function for tweets tokenization

In [None]:
train_data.head(20)

In [None]:
%%time

tknzr = TweetTokenizer() # initialization of Tweet Tokenizer

def mean_words_length(text):
    words = word_tokenize(text)
    word_lengths = [len(w) for w in words]
    return round(np.mean(word_lengths),1)

def features_1(data):
    # words count
    data['words_count'] = data['text'].apply(lambda x: len(tknzr.tokenize(x)))
    # numbers count
    numbers_regex = r"(\d+\.?,?\s?\d+)"
    data['numbers_count'] = data['text'].apply(lambda x: len(regexp_tokenize(x, numbers_regex)))
    # hashtags count
    hashtags_regex = r"#\w+"
    data['hashtags_count'] = data['text'].apply(lambda x: len(regexp_tokenize(x, hashtags_regex)))
    # mentions count
    mentions_regex = r"@\w+"
    data['mentions_count'] = data['text'].apply(lambda x: len(regexp_tokenize(x, mentions_regex)))
    # url count
    data['url_count'] = data['text'].apply(lambda x: len([w for w in str(x).lower().split() if 'http' in w or 'https' in w]))
    # mean words length
    data['mean_words_length'] = data['text'].apply(mean_words_length)
    # mean words length
    data['characters_count'] = data['text'].apply(lambda x: len(x))

features_1(train_data)
features_1(test_data)

In [None]:
train_data.head()

In [None]:
test_data.head()

In [None]:
fig, (ax1,ax2,ax3) = plt.subplots(1,3, figsize=(16,5))
sns.distplot(train_data[train_data['target']==1]['words_count'], label='True', ax=ax1)
sns.distplot(train_data[train_data['target']==0]['words_count'], label='Fake', ax=ax1)
ax1.legend()

sns.distplot(train_data[train_data['target']==1]['mean_words_length'], label='True', ax=ax2)
sns.distplot(train_data[train_data['target']==0]['mean_words_length'], label='Fake', ax=ax2)
ax2.legend()

sns.distplot(train_data[train_data['target']==1]['characters_count'], label='True', ax=ax3)
sns.distplot(train_data[train_data['target']==0]['characters_count'], label='Fake', ax=ax3)
ax3.legend()

plt.show()

Histograms above show that a word count of both types of tweets is similar but fake ones tend to have longer words in.

In [None]:
fig, (ax1,ax2,ax3,ax4) = plt.subplots(1,4, figsize=(16,5))

sns.countplot(x='numbers_count', hue='target', data=train_data, ax=ax1)
ax1.legend(labels=['Fake','True'], loc=1)

sns.countplot(x='hashtags_count', hue='target', data=train_data, ax=ax2)
ax2.legend(labels=['Fake','True'], loc=1)
ax2.yaxis.label.set_visible(False)

sns.countplot(x='mentions_count', hue='target', data=train_data, ax=ax3)
ax3.legend(labels=['Fake','True'], loc=1)
ax3.yaxis.label.set_visible(False)

sns.countplot(x='url_count', hue='target', data=train_data, ax=ax4)
ax4.legend(labels=['Fake','True'], loc=1)
ax4.yaxis.label.set_visible(False)

plt.show()

A barchart above shows that fake tweets have bigger count of numbers, hashtags amd mentions. It may be a way to increase the spreading speed of fake news (my theory).

Below threre are new additional columns that will be created:
1. `lowercase_bag_o_w` - lowercase bag of words
2. `stopwords`
3. `stopwords_count`
4. `alpha_omly` - bag of word with only alphabetical characters only
5. `alpha_count` - count of alphabetical characters
6. `punctuation_count`
Some of these columns may be dropped later if necessary.  

In [None]:
from nltk.corpus import stopwords
import string

def features_2(data):
    # lowercase tokens
    data['lowercase_bag_o_w'] = data['text'].apply(lambda x: [w for w in tknzr.tokenize(x.lower())])
    # stopwords
    data['stopwords'] = data['text_no_contr'].apply(lambda x: [t for t in x if t in stopwords.words('english')])
    # stopwords count
    data['stopwords_count'] = data['stopwords'].apply(lambda x: len(x))
    # alpha words only (excludes mentions and hashtags)
    data['alpha_only'] = data['lowercase_bag_o_w'].apply(lambda x: [t for t in x if t.isalpha()])
    # counts of alpha words only
    data['alpha_count'] = data['alpha_only'].apply(lambda x: len(x))
    # counts of punctuation marks only
    punctuation_regex = r"[^\w\s]"
    data['punctuation_count'] = data['text'].apply(lambda x: len(regexp_tokenize(x, punctuation_regex)))

features_2(train_data)
features_2(test_data)

In [None]:
train_data.head()

In [None]:
fig, (ax1,ax2,ax3,ax4) = plt.subplots(1,4, figsize=(16,6))
sns.distplot(train_data[train_data['target']==1]['stopwords_count'], label='True', ax=ax1)
sns.distplot(train_data[train_data['target']==0]['stopwords_count'], label='Fake', ax=ax1)
ax1.legend()

sns.distplot(train_data[train_data['target']==1]['alpha_count'], label='True', ax=ax2)
sns.distplot(train_data[train_data['target']==0]['alpha_count'], label='Fake', ax=ax2)
ax2.legend()

sns.distplot(train_data[train_data['target']==1]['punctuation_count'], label='True', ax=ax3)
sns.distplot(train_data[train_data['target']==0]['punctuation_count'], label='Fake', ax=ax3)
ax3.legend()

sns.distplot(train_data[train_data['target']==1]['url_count'], label='True', ax=ax4)
sns.distplot(train_data[train_data['target']==0]['url_count'], label='Fake', ax=ax4)
ax4.legend()
plt.show()

Histograms above show that there is a difference between number óf punctuation marks between false and true tweets: false tend to have less of them.

Below we will generate a wordcloud from raw text excluding stopwords.

In [None]:
from PIL import Image

alpha_only_cloud = " ".join(train_data['alpha_only'].explode())

# defining cloud of words
cloud_words_raw = " ".join(review for review in train_data.text)
stopwords = set(STOPWORDS)
stopwords.update(["http", "https","co","com","amp"])

# creating cloud of words
fig, ax1 = plt.subplots(figsize=(8,6))
wordcloud = WordCloud(stopwords=stopwords, background_color="white",height=300, contour_width=3).generate(cloud_words_raw)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Below there is a cloud of words from alphabetical entities only!

In [None]:
# creating cloud of words
fig, ax1 = plt.subplots(figsize=(10,6))
wordcloud = WordCloud(stopwords=stopwords, background_color="white",height=300, contour_color='firebrick', contour_width=3).generate(alpha_only_cloud)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

**BIGRAMS**

In [None]:
from nltk.util import bigrams

def extract_bigrams(data):
    # this function extracts bigrams from alphanumeric tweets without stopwords
    data['alpha_only_clean'] = data['alpha_only'].apply(lambda x: [item for item in x if item not in stopwords])
    data['bigrams'] = data['alpha_only_clean'].apply(lambda x: list(bigrams(x)))
    bigrams_list = data['bigrams'].tolist()
    bigrams_list = list([a for b in bigrams_list for a in b])
    return bigrams_list

train_bigrams = extract_bigrams(train_data)
test_bigrams = extract_bigrams(test_data)

In [None]:
import collections

def count_bigrams(bigrams_list):
    counter_bigrams = collections.Counter(bigrams_list)
    top30_bigrams = counter_bigrams.most_common(30)
    labels = [str(r[0]) for r in top30_bigrams]
    values = [r[1] for r in top30_bigrams]
    return labels,values

train_bgr_labels, train_bgr_values = count_bigrams(train_bigrams)
test_bgr_labels, test_bgr_values = count_bigrams(test_bigrams)

fig, ax = plt.subplots(1,2, figsize=(16,6))
sns.barplot(x=train_bgr_labels, y=train_bgr_values, ax=ax[0])
sns.barplot(x=test_bgr_labels, y=test_bgr_values, ax=ax[1])
ax[0].tick_params(labelrotation=90)
ax[1].tick_params(labelrotation=90)
ax[0].set_title('Training dataset', fontsize=13)
ax[1].set_title('Testing dataset', fontsize=13)
plt.show()

In [None]:
y_train = train_data.pop('target')

cols_to_drop = ['id','location','bigrams','text','text_no_contr','lowercase_bag_o_w','text_clean','stopwords','alpha_only','alpha_only_clean']
X_train = train_data.drop(cols_to_drop, axis=1)
X_train.head()

In [None]:
X_train = pd.get_dummies(X_train, prefix=['key'], columns=['keyword'])
X_train

In [None]:
X_test = test_data.drop(cols_to_drop, axis=1)
X_test = pd.get_dummies(X_test, prefix=['key'], columns=['keyword'])

In [None]:
print('After pre-processing there are {} rows and {} columns in a training dataset.'.format(X_train.shape[0],X_train.shape[1]))
print('After pre-processing there are {} rows and {} columns in a testing dataset.'.format(X_test.shape[0],X_test.shape[1]))

<a id='XGB'></a>
## 4. Baseline models<a href='#Top'>^</a><br>  

In this section we will create and optimise a few baseline models, both boosted (XGB, AdaBoost) and bagged (Random Forest).

First a basic XGBoost classification model will be created. XGBoost is modern state-of-the art boosting algorithm, very popular in many tasks and has a conveninent Scikit-Learn API. Check this [article](https://towardsdatascience.com/getting-started-with-xgboost-in-scikit-learn-f69f5f470a97) on Toward Data Science if you want to learn more details about it.

### 4.1 XGBoost

"XGBoost is an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI) and can solve problems beyond billions of examples." - XGBoost [documentation](https://xgboost.readthedocs.io/en/latest/).

In [None]:
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold

xgb = XGBClassifier(learning_rate=0.02, n_estimators=600, objective='binary:logistic',
                    silent=True, nthread=1)

Parameters to be investigated (arbitrary chosen).

In [None]:
# A parameter grid for XGBoost
params_1 = {
        'min_child_weight': [1, 5, 10],
        'gamma': [0.5, 1],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.8, 1.0],
        'max_depth': [5, 6, 7]
        }

Randomized search (alternatively grid search). Number of iteration has beed reduced to 4 to meet notebook maximum runtime on Kaggle (later stages require much more time).

In [None]:
%%time

folds = 3  # number of folds to be used

skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1)  # define a stratified K-Fold to preserve percentage of each target class

random_search_1 = RandomizedSearchCV(xgb, param_distributions=params_1, n_iter=4, scoring=['roc_auc','accuracy','recall','precision'],
                                   n_jobs=-1, cv=skf.split(X_train,y_train), verbose=2, random_state=1001, refit='roc_auc')

random_search_1.fit(X_train, y_train)
random_search_1.best_params_

Printing all scores for the estimator with the best ROC AUC.

In [None]:
def results_summary(classifier):
    roc_auc_results = classifier.cv_results_['mean_test_roc_auc']
    loc = np.where(roc_auc_results == np.amax(roc_auc_results))[0][0]

    rs_roc_auc = classifier.cv_results_['mean_test_roc_auc'][loc]
    rs_prec = classifier.cv_results_['mean_test_precision'][loc]
    rs_recall = classifier.cv_results_['mean_test_recall'][loc]
    rs_accur = classifier.cv_results_['mean_test_accuracy'][loc]

    print("ROC_AUC = {:.3f}".format(rs_roc_auc))
    print("Precision = {:.3f}".format(rs_prec))
    print("Recall = {:.3f}".format(rs_recall))
    print("Accuracy = {:.3f}".format(rs_accur))

    return [rs_roc_auc,rs_prec,rs_recall,rs_accur] # return array for the final summary

xgb_results = results_summary(random_search_1)

In [None]:
xgb_best = random_search_1.best_estimator_

In [None]:
from xgboost import plot_importance
plot_importance(xgb_best, max_num_features=15) # top 15 most important features
plt.show()

### 4.2 Ada Boost Classifier

Ada Boost Classifier sklearn API documentation: [LINK](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html)

In [None]:
from sklearn.ensemble import AdaBoostClassifier

ada = AdaBoostClassifier()

params_2 = {
        'n_estimators': np.arange(100,1200,100),
        'learning_rate': np.arange(0.1,1.1,0.2),
        }

In [None]:
%%time

random_search_2 = RandomizedSearchCV(ada, param_distributions=params_2, n_iter=4, scoring=['roc_auc','accuracy','recall','precision'],
                                   n_jobs=-1, cv=skf.split(X_train,y_train), verbose=3, random_state=1001, refit='roc_auc')
random_search_2.fit(X_train, y_train)

ada_best = random_search_2.best_estimator_
ada_results = results_summary(random_search_2)

In [None]:
ada_best.feature_importances_[:10]

In [None]:
X_train.columns[:10]

### 4.3 Random Forest Classifier

Random Forest Classifier sklearn API documentation: [LINK](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html?highlight=random%20forest#sklearn.ensemble.RandomForestClassifier)

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_jobs=-1)

params_3 = {
        'n_estimators': np.arange(100,1000,100),
        'max_depth': np.arange(30,110,10),
        'bootstrap': [True, False]
        }

In [None]:
random_search_3 = RandomizedSearchCV(rfc, param_distributions=params_3, n_iter=8, scoring=['roc_auc','accuracy','recall','precision'],cv=skf.split(X_train,y_train), verbose=3, random_state=1001, refit='roc_auc',n_jobs=-1)
random_search_3.fit(X_train, y_train)

rfc_best = random_search_3.best_estimator_
rfc_results = results_summary(random_search_3)

### 4.4 Extra Tree Classifier

Extra Tree Classifier sklearn API documentation: [LINK](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html?highlight=extra%20tree%20classifier#sklearn.ensemble.ExtraTreesClassifier)

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

etc = ExtraTreesClassifier(n_jobs=-1)

params_4 = {
        'n_estimators': np.arange(20,1100,100),
        'max_depth': np.arange(30,130,10),
        'bootstrap': [True, False]
        }

In [None]:
%%time

random_search_4 = RandomizedSearchCV(etc, param_distributions=params_4, n_iter=8, scoring=['roc_auc','accuracy','recall','precision'],
                                     cv=skf.split(X_train,y_train), verbose=2, random_state=1001, refit='roc_auc', n_jobs=-1)
random_search_4.fit(X_train, y_train)

etc_results = results_summary(random_search_4)
etc_best = random_search_4.best_estimator_

<a id='Building ensembles'></a>
## 5. Building ensembles<a href='#Top'>^</a><br>  

First we will build simple voting ensembles. The idea behind is to use independently-built models for voting in order to predict the labels. This type of ensemble is called bagging and it is aiming to reduce variance (not bias) and it's a good idea if you suspect over-fitting. Let's investigate what will be effect on our calssifiers.

Simple voting ensemble of boosted algorithms

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import cross_validate

In [None]:
%%time

ensemble_1 = VotingClassifier(estimators=[("XGB",xgb_best),("ADA",ada_best)], voting="soft")

cv_results_1 = cross_validate(ensemble_1, X_train, y_train, cv=3, scoring=("roc_auc","precision","recall","accuracy"),
                              return_train_score=True)

print("ROC_AUC = {:.3f}".format(cv_results_1["test_roc_auc"].mean()))
print("Precision = {:.3f}".format(cv_results_1["test_precision"].mean()))
print("Recall = {:.3f}".format(cv_results_1["test_recall"].mean()))
print("Accuracy = {:.3f}".format(cv_results_1["test_accuracy"].mean()))

ens1_results =[cv_results_1["test_roc_auc"].mean(),cv_results_1["test_precision"].mean(),
              cv_results_1["test_recall"].mean(), cv_results_1["test_accuracy"].mean()]

In [None]:
from sklearn.tree import DecisionTreeClassifier

clf1 = XGBClassifier(learning_rate=0.02, objective='binary:logistic',  silent=True)
clf2 = ExtraTreesClassifier()
clf3 = AdaBoostClassifier()

ens_2 = VotingClassifier(estimators = [('xgb', clf1), ('etc', clf2), ('ada', clf3)], voting = 'soft')

In [None]:
%%time

params = {'xgb__n_estimators': np.arange(500,1200,50), 'xgb__max_depth': [10,20,30],
         'etc__n_estimators': np.arange(500,2000,100), 'etc__max_depth': np.arange(10,130,10),
         'ada__n_estimators': np.arange(400,800,50), 'ada__learning_rate': [0.8,1.0]}

cv_results_2 = RandomizedSearchCV(estimator=ens_2, param_distributions=params, n_iter=128, scoring=['roc_auc','accuracy','recall','precision'],
                          cv=skf.split(X_train,y_train), verbose=3, random_state=10, refit='roc_auc', n_jobs=-1)

cv_results_2.fit(X_train, y_train)

ens2_results = results_summary(cv_results_2)

In [None]:
ens2_best = cv_results_2.best_estimator_

In [None]:
cv_results_2.best_params_

**SUMMARY**

Under construction....

In [None]:
summary = pd.DataFrame({"XGB":xgb_results,"ADA":ada_results,"RFC":rfc_results,"ETC":etc_results,
                        "Ens_1":ens1_results,"Ens_2":ens2_results},
                      index=["ROC_AUC","Precision","Recall","Accuracy"])
summary

## If you liked this notebook ---> **UPVOTE**.

In [None]:
predictions = ens2_best.predict(X_test)

In [None]:
output = pd.DataFrame({'id': test_data.id,
                       'target': predictions})
output.to_csv('submission.csv', index = False)
print('submission saved!')

In [None]:
output.head()