<a id='top'></a>
<h1 style="text-align:center;font-size:200%;;">Real or Not? NLP with Disaster Tweets</h1>
![](https://st.depositphotos.com/1032753/4674/v/950/depositphotos_46741417-stock-illustration-twitter-and-social-media-concept.jpg)

# About competition: <br>
* Twitter has become an important communication channel in times of emergency.The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies). <br>
* In this competition, we are challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running.<br>

<div class="list-group" id="list-tab" role="tablist">
  <h3 class="list-group-item list-group-item-action active" data-toggle="list"  role="tab" aria-controls="home">Notebook Content:</h3>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#Target Visualization" role="tab" aria-controls="profile">Part one: Target Visualization<span class="badge badge-primary badge-pill">1</span></a>
  <a class="list-group-item list-group-item-action" data-toggle="list" href="#word Embeddings" role="tab" aria-controls="messages">Part Two: Introduction to word Embeddings<span class="badge badge-primary badge-pill">2</span></a>
  <a class="list-group-item list-group-item-action"  data-toggle="list" href="#models" role="tab" aria-controls="settings">Part three: Building basic models and text preprocessing to improve score<span class="badge badge-primary badge-pill">3</span></a>
  

#### Credits and refrences: <br>
I have learned these techniques and implemented in this competitions from following kernels: <br>
1. [Target Visualization - T-SNE and Doc2Vec](https://www.kaggle.com/arthurtok/target-visualization-t-sne-and-doc2vec) <br>
2. [A Detailed Explanation of Keras Embedding Layer](https://www.kaggle.com/rajmehra03/a-detailed-explanation-of-keras-embedding-layer) <br>
3. [A look at different embeddings.!](https://www.kaggle.com/sudalairajkumar/a-look-at-different-embeddings)<br>
4. [Improve your Score with Text Preprocessing -- V2](https://www.kaggle.com/theoviel/improve-your-score-with-text-preprocessing-v2)<br>
Thanks to authors of the above kernels :)

<a id='Target Visualization'></a>
# <font color='red'> Part one: Target Visualization</font> <br>

In this part one will be see an exploration into the target variable and how it is distributed across the structure of the training data to see if any potential information or patterns can be gleaned going forward. Since classical treatments of text data normally comes with the challenges of high dimensionality (using term frequencies or term frequency inverse document frequencies), the plan therefore in this kernel is to visually explore the target variable in some lower dimensional space.

# 1. Importing necessary modules.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import gc

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import TfidfVectorizer

from nltk.tokenize import word_tokenize, sent_tokenize, TweetTokenizer 
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.corpus import stopwords
from string import punctuation

import re
from functools import reduce

import bokeh.plotting as bp
from bokeh.models import HoverTool, BoxSelectTool
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, show, output_notebook, reset_output
from bokeh.palettes import d3
import bokeh.models as bmo
from bokeh.io import save, output_file

# init_notebook_mode(connected = True)
# color = sns.color_palette("Set2")
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

pd.options.mode.chained_assignment = None
pd.options.display.max_columns = 999
pd.options.display.max_rows = 999

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

# 2. Importing dataframes.

In [None]:
train_df = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test_df = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
sub = pd.read_csv('/kaggle/input/nlp-getting-started/sample_submission.csv')

In [None]:
train_df.head()

## Columns description
* id - a unique identifier for each tweet
* text - the text of the tweet
* location - the location the tweet was sent from (may be blank)
* keyword - a particular keyword from the tweet (may be blank)
* target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

In [None]:
train_df.target.value_counts()

# 3. Resampling the training data.

In [None]:
sample_size = 3271 

# Rebalancing the training set
train_rebal = train_df[train_df.target == 1].sample(sample_size).append(train_df[train_df.target == 0].sample(sample_size)).reset_index()

# 4. Text processing.<br>
**In this section, we will do some pre-processing of the text contained within the training data. The processing applied here are some of the standard NLP steps that one would implement in a text based problem, consisting of:**

* Tokenization
* Stemming or Lemmatization

In [None]:
def remove_stopwords(words):
    """
    Function to remove stopwords from the text
    """
    stop_words = set(stopwords.words("english"))
    return [word for word in words if word not in stop_words]

def remove_punctuation(text):
    """
    Function to remove punctuation from the text
    """
    return re.sub(r'[^\w\s]', '', text)

def lemmatize_text(words):
    """
    Function to lemmatize the text
    """
    lemmatizer = WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in words]

def stem_text(words):
    """
    Function to stem th question text
    """
    ps = PorterStemmer()
    return [ps.stem(word) for word in words]

In [None]:
puncts=['☹', 'Ź', 'Ż', 'ἰ', 'ή', 'Š', '＞', 'ξ','ฉ', 'ั', 'น', 'จ', 'ะ', 'ท', 'ำ', 'ใ', 'ห', '้', 'ด', 'ี', '่', 'ส', 'ุ', 'Π', 'प', 'ऊ', 'Ö', 'خ', 'ب', 'ஜ', 'ோ', 'ட', '「', 'ẽ', '½', '△', 'É', 'ķ', 'ï', '¿', 'ł', '북', '한', '¼', '∆', '≥', '⇒', '¬', '∨', 'č', 'š', '∫', 'ḥ', 'ā', 'ī', 'Ñ', 'à', '▾', 'Ω', '＾', 'ý', 'µ', '?', '!', '.', ',', '"', '#', '$', '%', '\\', "'", '(', ')', '*', '+', '-', '/', ':', ';', '<', '=', '>', '@', '[', ']', '^', '_', '`', '{', '|', '}', '~', '“', '”', '’', 'é', 'á', '′', '…', 'ɾ', '̃', 'ɖ', 'ö', '–', '‘', 'ऋ', 'ॠ', 'ऌ', 'ॡ', 'ò', 'è', 'ù', 'â', 'ğ', 'म', 'ि', 'ल', 'ग', 'ई', 'क', 'े', 'ज', 'ो', 'ठ', 'ं', 'ड', 'Ž', 'ž', 'ó', '®', 'ê', 'ạ', 'ệ', '°', 'ص', 'و', 'ر', 'ü', '²', '₹', 'ú', '√', 'α', '→', 'ū', '—', '£', 'ä', '️', 'ø', '´', '×', 'í', 'ō', 'π', '÷', 'ʿ', '€', 'ñ', 'ç', 'へ', 'の', 'と', 'も', '↑', '∞', 'ʻ', '℅''ι', '•', 'ì', '−', 'л', 'я', 'д', 'ل', 'ك', 'م', 'ق', 'ا', '∈', '∩', '⊆', 'ã', 'अ', 'न', 'ु', 'स', '्', 'व', 'ा', 'र', 'त', '§', '℃', 'θ', '±', '≤', 'उ', 'द', 'य', 'ब', 'ट', '͡', '͜', 'ʖ', '⁴', '™', 'ć', 'ô', 'с', 'п', 'и', 'б', 'о', 'г', '≠', '∂', 'आ', 'ह', 'भ', 'ी', '³', 'च', '...', '⌚', '⟨', '⟩', '∖', '˂', 'ⁿ', '⅔', 'న', 'ీ', 'క', 'ె', 'ం', 'ద', 'ు', 'ా', 'గ', 'ర', 'ి', 'చ', 'র', 'ড়', 'ঢ়', 'સ', 'ં', 'ઘ', 'ર', 'ા', 'જ', '્', 'ય', 'ε', 'ν', 'τ', 'σ', 'ş', 'ś', 'س', 'ت', 'ط', 'ي', 'ع', 'ة', 'د', 'Å', '☺', 'ℇ', '❤', '♨', '✌', 'ﬁ', 'て', '„', 'Ā', 'ត', 'ើ', 'ប', 'ង', '្', 'អ', 'ូ', 'ន', 'ម', 'ា', 'ធ', 'យ', 'វ', 'ី', 'ខ', 'ល', 'ះ', 'ដ', 'រ', 'ក', 'ឃ', 'ញ', 'ឯ', 'ស', 'ំ', 'ព', 'ិ', 'ៃ', 'ទ', 'គ', '¢', 'つ', 'や', 'ค', 'ณ', 'ก', 'ล', 'ง', 'อ', 'ไ', 'ร', 'į', 'ی', 'ю', 'ʌ', 'ʊ', 'י', 'ה', 'ו', 'ד', 'ת', 'ᠠ', 'ᡳ', 'ᠰ', 'ᠨ', 'ᡤ', 'ᡠ', 'ᡵ', 'ṭ', 'ế', 'ध', 'ड़', 'ß', '¸', 'ч',  'ễ', 'ộ', 'फ', 'μ', '⧼', '⧽', 'ম', 'হ', 'া', 'ব', 'ি', 'শ', '্', 'প', 'ত', 'ন', 'য়', 'স', 'চ', 'ছ', 'ে', 'ষ', 'য', '়', 'ট', 'উ', 'থ', 'ক', 'ῥ', 'ζ', 'ὤ', 'Ü', 'Δ', '내', '제', 'ʃ', 'ɸ', 'ợ', 'ĺ', 'º', 'ष', '♭', '़', '✅', '✓', 'ě', '∘', '¨', '″', 'İ', '⃗', '̂', 'æ', 'ɔ', '∑', '¾', 'Я', 'х', 'О', 'з', 'ف', 'ن', 'ḵ', 'Č', 'П', 'ь', 'В', 'Φ', 'ỵ', 'ɦ', 'ʏ', 'ɨ', 'ɛ', 'ʀ', 'ċ', 'օ', 'ʍ', 'ռ', 'ք', 'ʋ', '兰', 'ϵ', 'δ', 'Ľ', 'ɒ', 'î', 'Ἀ', 'χ', 'ῆ', 'ύ', 'ኤ', 'ል', 'ሮ', 'ኢ', 'የ', 'ኝ', 'ን', 'አ', 'ሁ', '≅', 'ϕ', '‑', 'ả', '￼', 'ֿ', 'か', 'く', 'れ', 'ő', '－', 'ș', 'ן', 'Γ', '∪', 'φ', 'ψ', '⊨', 'β', '∠', 'Ó', '«', '»', 'Í', 'க', 'வ', 'ா', 'ம', '≈', '⁰', '⁷', 'ấ', 'ũ', '눈', '치', 'ụ', 'å', '،', '＝', '（', '）', 'ə', 'ਨ', 'ਾ', 'ਮ', 'ੁ', '︠', '︡', 'ɑ', 'ː', 'λ', '∧', '∀', 'Ō', 'ㅜ', 'Ο', 'ς', 'ο', 'η', 'Σ', 'ण']
odd_chars=[ '大','能', '化', '生', '水', '谷', '精', '微', 'ル', 'ー', 'ジ', 'ュ', '支', '那', '¹', 'マ', 'リ', '仲', '直', 'り', 'し', 'た', '主', '席', '血', '⅓', '漢', '髪', '金', '茶', '訓', '読', '黒', 'ř', 'あ', 'わ', 'る', '胡', '南', '수', '능', '广', '电', '总', 'ί', '서', '로', '가', '를', '행', '복', '하', '게', '기', '乡', '故', '爾', '汝', '言', '得', '理', '让', '骂', '野', '比', 'び', '太', '後', '宮', '甄', '嬛', '傳', '做', '莫', '你', '酱', '紫', '甲', '骨', '陳', '宗', '陈', '什', '么', '说', '伊', '藤', '長', 'ﷺ', '僕', 'だ', 'け', 'が', '街', '◦', '火', '团', '表',  '看', '他', '顺', '眼', '中', '華', '民', '國', '許', '自', '東', '儿', '臣', '惶', '恐', 'っ', '木', 'ホ', 'ج', '教', '官', '국', '고', '등', '학', '교', '는', '몇', '시', '간', '업', '니', '本', '語', '上', '手', 'で', 'ね', '台', '湾', '最', '美', '风', '景', 'Î', '≡', '皎', '滢', '杨', '∛', '簡', '訊', '短', '送', '發', 'お', '早', 'う', '朝', 'ش', 'ه', '饭', '乱', '吃', '话', '讲', '男', '女', '授', '受', '亲', '好', '心', '没', '报', '攻', '克', '禮', '儀', '統', '已', '經', '失', '存', '٨', '八', '‛', '字', '：', '别', '高', '兴', '还', '几', '个', '条', '件', '呢', '觀', '《', '》', '記', '宋', '楚', '瑜', '孫', '瀛', '枚', '无', '挑', '剔', '聖', '部', '頭', '合', '約', 'ρ', '油', '腻', '邋', '遢', 'ٌ', 'Ä', '射', '籍', '贯', '老', '常', '谈', '族', '伟', '复', '平', '天', '下', '悠', '堵', '阻', '愛', '过', '会', '俄', '罗', '斯', '茹', '西', '亚', '싱', '관', '없', '어', '나', '이', '키', '夢', '彩', '蛋', '鰹', '節', '狐', '狸', '鳳', '凰', '露', '王', '晓', '菲', '恋', 'に', '落', 'ち', 'ら', 'よ', '悲', '反', '清', '復', '明', '肉', '希', '望', '沒', '公', '病', '配', '信', '開', '始', '日', '商', '品', '発', '売', '分', '子', '创', '意', '梦', '工', '坊', 'ک', 'پ', 'ڤ', '蘭', '花', '羡', '慕', '和', '嫉', '妒', '是', '样', 'ご', 'め', 'な', 'さ', 'い', 'す', 'み', 'ま', 'せ', 'ん', '音', '红', '宝', '书', '封', '柏', '荣', '江', '青', '鸡', '汤', '文', '粵', '拼', '寧', '可', '錯', '殺', '千', '絕', '放', '過', '」', '之', '勢', '请', '国', '知', '识', '产', '权', '局', '標', '點', '符', '號', '新', '年', '快', '乐', '学', '业', '进', '步', '身', '体', '健', '康', '们', '读', '我', '的', '翻', '译', '篇', '章', '欢', '迎', '入', '坑', '有', '毒', '黎', '氏', '玉', '英', '啧', '您', '这', '口', '味', '奇', '特', '也', '就', '罢', '了', '非', '要', '以', '此', '为', '依', '据', '对', '人', '家', '批', '判', '一', '番', '不', '地', '道', '啊', '谢', '六', '佬']
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "can't've": "cannot have", "'cause": "because", "could've": "could have", "couldn't": "could not", "couldn't've": "could not have","didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hadn't've": "had not have", "hasn't": "has not", "haven't": "have not",  "he'd": "he would", "he'd've": "he would have", "he'll": "he will", "he'll've": "he will have", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not","sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have","that's": "that is", "there'd": "there would", "there'd've": "there would have","there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are", "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have" }

In [None]:
def clean_numbers(x):
    x = re.sub('[0-9]{5,}', ' ##### ', x)
    x = re.sub('[0-9]{4}', ' #### ', x)
    x = re.sub('[0-9]{3}', ' ### ', x)
    x = re.sub('[0-9]{2}', ' ## ', x)
    return x

def punct_add_space(x):
    x = str(x)
    for punct in puncts:
        x = x.replace(punct, f' {punct} ')
    return x  

def odd_add_space(x):
    x = str(x)
    for odd in odd_chars:
        x = x.replace(odd, f' {odd} ')
    return x 

def clean_contractions(text, mapping):
    specials = ["’", "‘", "´", "`"]
    for s in specials:
        text = text.replace(s, "'")
    text = ' '.join([mapping[t] if t in mapping else t for t in text.split(" ")])
    return text

In [None]:
train_rebal["text"] = train_rebal["text"].apply(lambda x: clean_numbers(x))
train_rebal["text"] = train_rebal["text"].apply(lambda x: punct_add_space(x))
train_rebal["text"] = train_rebal["text"].apply(lambda x: odd_add_space(x))
train_rebal["text"] = train_rebal["text"].apply(lambda x: clean_contractions(x, contraction_mapping))

In [None]:
train_rebal["text"] = train_rebal["text"].apply(lambda x: word_tokenize(x))

In [None]:
train_rebal.head()

# 5. T-SNE applied to Latent Semantic (LSA) space
* To start off we look at the sparse representation of text documents via the Term frequency Inverse document frequency method. What this does is create a matrix representation that upweights locally prevalent but globally rare terms - therefore accounting for the occurence bias when using just term frequencies.

In [None]:
tf_idf_vec = TfidfVectorizer(min_df=3,
                             max_features = None, 
                             analyzer="word",
                             ngram_range=(1,3), # (1,6)
                             stop_words="english")
tf_idf = tf_idf_vec.fit_transform(list(train_rebal["text"].map(lambda tokens: " ".join(tokens))))

* Having obtained our tf-idf matrix - a sparse matrix object, we now apply the TruncatedSVD method to first reduce the dimensionality of the Tf-idf matrix to a decomposed feature space, referred to in the community as the LSA (Latent Semantic Analysis) method.

* LSA has been one of the classical methods in text that have existed for a while allowing "concept" searching of words whereby words which are semantically similar to each other (i.e. have more context) are closer to each other in this space and vice-versa.

# 6.Scatter plots of the Latent Semantic Space

## 6.1 First 3 dimensions of the Latent Semantic Space

In [None]:
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=50, random_state=2020)
svd_tfidf = svd.fit_transform(tf_idf)
print("Dimensionality of LSA space: {}".format(svd_tfidf.shape))

In [None]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(16,12))

# Plot models:
ax = Axes3D(fig) 
ax.scatter(svd_tfidf[:,0],
           svd_tfidf[:,1],
           svd_tfidf[:,2],
           c=train_rebal.target.values,
           cmap=plt.cm.winter_r,
           s=20,
           edgecolor='none',
           marker='o')
plt.title("Semantic Tf-Idf-SVD reduced plot of real-Not real data distribution")
plt.xlabel("First dimension")
plt.ylabel("Second dimension")
plt.legend()
plt.xlim(0.0, 0.4)
plt.ylim(-0.2,0.4)
plt.show()

## 6.2 Random 3 dimensions of the Latent Semantic Space

In [None]:
fig = plt.figure(figsize=(16,12))

# Plot models:
ax = Axes3D(fig) 
ax.scatter(svd_tfidf[:,20],
           svd_tfidf[:,21],
           svd_tfidf[:,22],
           c=train_rebal.target.values,
           cmap=plt.cm.winter_r,
           s=20,
           edgecolor='none',
           marker='o')
plt.title("Semantic Tf-Idf-SVD reduced plot of real-Not real data distribution")
plt.xlabel("First dimension")
plt.ylabel("Second dimension")
plt.legend()
plt.xlim(-0.4, 0.4)
plt.ylim(-0.3,0.4)
plt.show()

## 6.3 Last 3 dimensions of the Latent Semantic Space

In [None]:
fig = plt.figure(figsize=(16,12))

# Plot models:
ax = Axes3D(fig) 
ax.scatter(svd_tfidf[:,47],
           svd_tfidf[:,48],
           svd_tfidf[:,49],
           c=train_rebal.target.values,
           cmap=plt.cm.winter_r,
           s=20,
           edgecolor='none',
           marker='x')
plt.title("Semantic Tf-Idf-SVD reduced plot of real-Not real data distribution")
plt.xlabel("First dimension")
plt.ylabel("Second dimension")
plt.legend()
plt.xlim(-0.2, 0.6)
plt.ylim(-0.2,0.2)
plt.show()

## Observations: <br>
* From the above scatter plots, It is apparent that real disaster tweets and not real disaster tweets overlap quite significantly in the LSA semantic space. <br>
* Also,there does not seem to be any clear or obvious pattern in segregating the class labels. <br>

# 7. Applying T-SNE(non-linear method) to LSA reduced space

In [None]:
!pip install MulticoreTSNE

In [None]:
from MulticoreTSNE import MulticoreTSNE as TSNE
tsne_model = TSNE(n_jobs=4,
                  early_exaggeration=4, # Trying out exaggeration trick
                  n_components=2,
                  verbose=1,
                  random_state=2020,
                  n_iter=500)
tsne_tfidf = tsne_model.fit_transform(svd_tfidf)

In [None]:
tsne_tfidf_df = pd.DataFrame(data=tsne_tfidf, columns=["x", "y"])
tsne_tfidf_df["id"] = train_rebal["id"].values
tsne_tfidf_df["text"] = train_rebal["text"].values
tsne_tfidf_df["target"] = train_rebal["target"].values

In [None]:
output_notebook()
plot_tfidf = bp.figure(plot_width = 600, plot_height = 600, 
                       title = "T-SNE applied to Tfidf_SVD space",
                       tools = "pan, wheel_zoom, box_zoom, reset, hover, previewsave",
                       x_axis_type = None, y_axis_type = None, min_border = 1)

# colormap = np.array(["#6d8dca", "#d07d3c"])
colormap = np.array(["darkblue", "red"])

# palette = d3["Category10"][len(tsne_tfidf_df["asset_name"].unique())]
source = ColumnDataSource(data = dict(x = tsne_tfidf_df["x"], 
                                      y = tsne_tfidf_df["y"],
                                      color = colormap[tsne_tfidf_df["target"]],
                                      text = tsne_tfidf_df["text"],
                                      id = tsne_tfidf_df["id"],
                                      target = tsne_tfidf_df["target"]))

plot_tfidf.scatter(x = "x", 
                   y = "y", 
                   color="color",
                   legend = "target",
                   source = source,
                   alpha = 1)
hover = plot_tfidf.select(dict(type = HoverTool))
hover.tooltips = {"id": "@id", 
                  "text": "@text", 
                  "target":"@target"}

show(plot_tfidf)

## Observations:<br>
* It seems that the distribution of the real disaster tweets and not real disaster tweets overlap in certain regions of the T-SNE plots in concept space, which does not allow easy visual discernment between the two classes. <br>
* This raises a question of how easy therefore, is it to a human to distinguish between an real disaster tweets and not real disaster tweets, when we see data from both class labels overlapping quite heavily across each other. <br>

# 8. Visualising T-SNE applied to LSA reduced space by changing Perplexity.

In [None]:
tsne_model_5 = TSNE(n_jobs=4, 
                    early_exaggeration=4,
                  perplexity=5,
                  n_components=2,
                  verbose=1,
                  random_state=2020,
                  n_iter=500)
tsne_tfidf_5 = tsne_model_5.fit_transform(svd_tfidf[:6542,:])
# Creating a Dataframe for Perplexity=5
tsne_tfidf_df_5 = pd.DataFrame(data=tsne_tfidf_5, columns=["x5", "y5"])
tsne_tfidf_df_5["target"] = train_rebal["target"][:6542].values

In [None]:
tsne_model_25 = TSNE(n_jobs=4, 
                     early_exaggeration=4,
                  perplexity=25,
                  n_components=2,
                  verbose=1,
                  random_state=2020,
                  n_iter=500)
tsne_tfidf_25 = tsne_model_25.fit_transform(svd_tfidf[:6542,:])
# Creating a Dataframe for Perplexity=5
tsne_tfidf_df_25 = pd.DataFrame(data=tsne_tfidf_25, 
                             columns=["x25", "y25"])
tsne_tfidf_df_25["target"] = train_rebal["target"][:6542].values

In [None]:
tsne_model_50 = TSNE(n_jobs=4, 
                     early_exaggeration=4,
                  perplexity=50,
                  n_components=2,
                  verbose=1,
                  random_state=2020,
                  n_iter=500)
tsne_tfidf_50 = tsne_model_50.fit_transform(svd_tfidf[:6542,:])
# Creating a Dataframe for Perplexity=50
tsne_tfidf_df_50 = pd.DataFrame(data=tsne_tfidf_50, 
                                columns=["x50", "y50"])
tsne_tfidf_df_50["target"] = train_rebal["target"][:6542].values

In [None]:
plt.figure(figsize=(14,8))
plt.scatter(tsne_tfidf_df_5.x5, 
            tsne_tfidf_df_5.y5, 
            alpha=0.75,
            c=tsne_tfidf_df_5.target,
            cmap=plt.cm.coolwarm)
plt.title("T-SNE plot in SVD space (perplexity=5)")
plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(14,8))
plt.scatter(tsne_tfidf_df_25.x25, 
            tsne_tfidf_df_25.y25, 
            c=tsne_tfidf_df_25.target,
            cmap=plt.cm.coolwarm)
plt.title("T-SNE plot in SVD space (perplexity=25)")
plt.legend()
plt.show()

In [None]:
plt.figure(figsize=(14,8))
plt.scatter(tsne_tfidf_df_50.x50, 
            tsne_tfidf_df_50.y50, 
            c=tsne_tfidf_df_50.target,
            cmap=plt.cm.coolwarm)
plt.title("T-SNE plot in SVD space (perplexity=50)")
plt.legend()
plt.show()

## Observations: <br>
*  Here also the distribution of the real disaster tweets and not real disaster tweets overlap in certain regions of the T-SNE plots in concept space, which does not allow easy visual discernment between the two classes.

# 9. T-SNE applied on Doc2Vec embedding<br>
* Moving forward with our T-SNE visual explorations, we next move away from semantic matrices into the realm of embeddings. Here we will use the Doc2Vec algorithm and much like its very well known counterpart Word2vec involves unsupervised learning of continuous representations for text. Unlike Word2vec which involves finding the representations for words (i.e. word embeddings), Doc2vec modifies the former method and extends it to sentences and even documents.<br>

In [None]:
from gensim.test.utils import common_texts
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
texts = list(train_rebal["text"])

# Creating a list of terms and a list of labels to go with it
documents = [TaggedDocument(doc, tags=[str(i)]) for i, doc in enumerate(texts)]

In [None]:
max_epochs = 100
alpha=0.025
model = Doc2Vec(documents,
                size=10, 
                min_alpha=0.00025,
                alpha=alpha,
                min_count=1,
                workers=4)

## Fitting a T-SNE model to the dense embeddings and overlaying that with the target visuals, we get:<br>

In [None]:
tsne_model = TSNE(n_jobs=4,
                  early_exaggeration=4,
                  n_components=2,
                  verbose=1,
                  random_state=2020,
                  n_iter=300)
tsne_d2v = tsne_model.fit_transform(model.docvecs.vectors_docs)

# Putting the tsne information into sq
tsne_d2v_df = pd.DataFrame(data=tsne_d2v, columns=["x", "y"])
tsne_d2v_df["id"] = train_rebal["id"].values
tsne_d2v_df["text"] = train_rebal["text"].values
tsne_d2v_df["target"] = train_rebal["target"].values

In [None]:
output_notebook()
plot_d2v = bp.figure(plot_width = 500, plot_height = 500, 
                       title = "T-SNE applied to Doc2vec document embeddings",
                       tools = "pan, wheel_zoom, box_zoom, reset, hover, previewsave",
                       x_axis_type = None, y_axis_type = None, min_border = 1)

colormap = np.array(["darkblue", "cyan"])

source = ColumnDataSource(data = dict(x = tsne_d2v_df["x"], 
                                      y = tsne_d2v_df["y"],
                                      color = colormap[tsne_d2v_df["target"]],
                                      text = tsne_d2v_df["text"],
                                      id = tsne_d2v_df["id"],
                                      target = tsne_d2v_df["target"]))

plot_d2v.scatter(x = "x", 
                   y = "y", 
                   color="color",
                   legend = "target",
                   source = source,
                   alpha = 1.0)
hover = plot_d2v.select(dict(type = HoverTool))
hover.tooltips = {"id": "@id", 
                  "text": "@text", 
                  "target":"@target"}

show(plot_d2v)

## Observations: <br>
* The visual overlap between real disaster tweets and not real disaster tweets are pretty high in some regions and not that much high in some regions in the Doc2Vec plots.<br>
* But,we cannot saggregate the labels using our eye ball. <br>

<a id='word Embeddings'></a>
# <font color='blue'>Part Two: Introduction to word Embeddings</font>

## 2.1 What are word embeddings?<br>
* A word embedding is a learned representation for text where words that have the same meaning have a similar representation.<br>
* In very simplistic terms, Word Embeddings are the texts converted into numbers and there may be different numerical representations of the same text. <br>

## 2.2 Why do we need word embeddings?
* As it turns out, many Machine Learning algorithms and almost all Deep Learning Architectures are incapable of processing strings or plain text in their raw form. They require numbers as inputs to perform any sort of job, be it classification, regression etc. in broad terms. And with the huge amount of data that is present in the text format, it is imperative to extract knowledge out of it and build applications.<br>

## 2.3 Different types of word embeddings <br>
* The different types of word embeddings can be broadly classified into two categories- <br>

1. Frequency based Embedding <br>
  1.1 Count Vector <br>
  1.2 Tf-IDF Vector <br>
  1.3 Co-Occurance Vector <br>
  
2. Prediction based Embedding <br>
#### [Read more](https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/)

## 2.4 Word Embedding Algorithms <br>
* Word embedding methods learn a real-valued vector representation for a predefined fixed sized vocabulary from a corpus of text.The learning process is either joint with the neural network model on some task. <br>

### 2.4.1 Word Embedding layer <br>
*  A word embedding layer that is learned jointly with a neural network model on a specific natural language processing task classification,text generation etc..,
* It requires that text be cleaned and prepared such that each word is one-hot encoded. The size of the vector space is specified as part of the model, such as 50, 100, or 300 dimensions. The vectors are initialized with small random numbers. The embedding layer is used on the front end of a neural network and is fit in a supervised way using the Backpropagation algorithm. <br>
* The one-hot encoded words are mapped to the word vectors. If a multilayer Perceptron model is used, then the word vectors are concatenated before being fed as input to the model. If a recurrent neural network is used, then each word may be taken as one input in a sequence. <br>
* [Read more here](https://machinelearningmastery.com/what-are-word-embeddings/)

## 2.5 Training our own embedding layer in keras <br>

In [None]:
import nltk

#stop-words
from nltk.corpus import stopwords
stop_words=set(nltk.corpus.stopwords.words('english'))

# tokenizing
from nltk import word_tokenize,sent_tokenize

import math
from sklearn.model_selection import train_test_split
from sklearn import metrics


#keras
import keras
import tensorflow as tf
from tensorflow.keras.preprocessing.text import one_hot,Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense , Flatten ,Embedding,Input
from tensorflow.keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation, GRU, Conv1D,Lambda
from tensorflow.keras.layers import Bidirectional, GlobalMaxPool1D, GlobalMaxPooling1D, GlobalAveragePooling1D
from tensorflow.keras.layers import Input, Embedding, Dense, Conv2D, MaxPool2D, concatenate
from tensorflow.keras.layers import Reshape, Flatten, Concatenate, Dropout, SpatialDropout1D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.models import Model
from tensorflow.keras import backend as K
from keras.engine.topology import Layer
from tensorflow.keras import initializers, regularizers, constraints, optimizers, layers
from tensorflow.keras.layers import concatenate
from tensorflow.keras.callbacks import *
#custome function for f1 score
def f1(y_true, y_pred):
    def recall(y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        possible_positives = K.sum(K.round(K.clip(y_true, 0, 1)))
        recall = true_positives / (possible_positives + K.epsilon())
        return recall

    def precision(y_true, y_pred):
        true_positives = K.sum(K.round(K.clip(y_true * y_pred, 0, 1)))
        predicted_positives = K.sum(K.round(K.clip(y_pred, 0, 1)))
        precision = true_positives / (predicted_positives + K.epsilon())
        return precision
    
    precision = precision(y_true, y_pred)
    recall = recall(y_true, y_pred)
    return 2*((precision*recall)/(precision+recall+K.epsilon()))

### Taking our sample text corpus

In [None]:
sample_text_1="Kaggle, a subsidiary of Google LLC, is an online community of Data scientists and machine learning practitioners"
sample_text_2="Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data"
corp=[sample_text_1,sample_text_2]
no_docs=len(corp)

### INTEGER ENCODING ALL THE DOCUMENTS
* After this all the unique words will be reprsented by an integer. For this we are using one_hot function from the Keras. Note that the vocab_size is specified large enough so as to ensure unique integer encoding for each and every word.

* Note one important thing that the integer encoding for the word remains same in different docs. eg 'Data' is denoted by 21 in each and every document.

In [None]:
vocab_size=50 
encod_corp=[]
for i,doc in enumerate(corp):
    encod_corp.append(one_hot(doc,50))
    print("The encoding for document",i+1," is : ",one_hot(doc,50))

### PADDING THE DOCS (to make very doc of same length): <br>
* The Keras Embedding layer requires all individual documents to be of same length. Hence we wil pad the shorter documents with 0 for now. Therefore now in Keras Embedding layer the 'input_length' will be equal to the length (ie no of words) of the document with maximum length or maximum number of words.

* To pad the shorter documents I am using pad_sequences functon from the Keras library.

In [None]:
maxlen=-1
for doc in corp:
    tokens=nltk.word_tokenize(doc)
    if(maxlen<len(tokens)):
        maxlen=len(tokens)
print("The maximum number of words in any document is : ",maxlen)

In [None]:
# now to create embeddings all of our docs need to be of same length. hence we can pad the docs with zeros.
pad_corp=pad_sequences(encod_corp,maxlen=maxlen,padding='post',value=0.0)
print("No of padded documents: ",len(pad_corp))

In [None]:
for i,doc in enumerate(pad_corp):
     print("The padded encoding for document",i+1," is : ",doc)

### CREATING THE EMBEDDINGS USING KERAS EMBEDDING LAYER <br>

In [None]:
input=Input(shape=(no_docs,maxlen),dtype='float64')

In [None]:
word_input=Input(shape=(maxlen,),dtype='float64')  

# creating the embedding
word_embedding=Embedding(input_dim=vocab_size,output_dim=8,input_length=maxlen)(word_input)

word_vec=Flatten()(word_embedding) # flatten
embed_model =Model([word_input],word_vec) # combining all into a Keras model

In [None]:
embed_model.compile(optimizer=Adam(lr=1e-3),loss='binary_crossentropy',metrics=['acc']) 

In [None]:
print(type(word_embedding))
print(word_embedding)

In [None]:
print(embed_model.summary())

In [None]:
embeddings=embed_model.predict(pad_corp)

In [None]:
print("Shape of embeddings : ",embeddings.shape)
print(embeddings)

In [None]:
embeddings=embeddings.reshape(-1,maxlen,8)
print("Shape of embeddings : ",embeddings.shape) 
print(embeddings)

The resulting shape is (2,34,8).

2---> no of documents

34---> each document is made of 34 words which was our maximum length of any document.

& 8---> each word is 8 dimensional.

### GETTING ENCODING FOR A PARTICULAR WORD IN A SPECIFIC TEXT <br>

In [None]:
for i,doc in enumerate(embeddings):
    for j,word in enumerate(doc):
        print("The encoding for ",j+1,"th word","in",i+1,"th document is : \n\n",word)

<a id='models'></a>
# <font color='orange'> Part three: Building basic models and text preprocessing to improve score</font>  

* In this part we will build some basic models with simple architectures.Also,we will explore text processing techniques when we are using word embeddings.<br>
* Also we will compare results of the model build on text processed text and text which was not processed and cleaned.

### Preparing data:

In [None]:
train_df, val_df = train_test_split(train_df, test_size=0.1, random_state=2020)

## some config values 
embed_size = 128 # how big is each word vector
max_features = 20000 # how many unique words to use (i.e num rows in embedding vector)
maxlen = 100 # max number of words in a question to use

train_X = train_df["text"].values
val_X = val_df["text"].values
test_X = test_df["text"].values

## Tokenize the sentences
tokenizer = Tokenizer(num_words=max_features)
tokenizer.fit_on_texts(list(train_X))
train_X = tokenizer.texts_to_sequences(train_X)
val_X = tokenizer.texts_to_sequences(val_X)
test_X = tokenizer.texts_to_sequences(test_X)

## Pad the sentences 
train_X = pad_sequences(train_X, maxlen=maxlen)
val_X = pad_sequences(val_X, maxlen=maxlen)
test_X = pad_sequences(test_X, maxlen=maxlen)

## Get the target values
train_y = train_df['target'].values
val_y = val_df['target'].values

### Building model without pretrained embeddings

In [None]:
inp = Input(shape=(maxlen,))
x = Embedding(max_features, embed_size)(inp)
x = Bidirectional(GRU(64, return_sequences=True))(x)
x = GlobalMaxPool1D()(x)
x = Dense(16, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(1, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='Adamax', metrics=['accuracy',f1])

print(model.summary())

In [None]:
model.fit(train_X, train_y, batch_size=512, epochs=10, validation_data=(val_X, val_y))

In [None]:
pred_noemb_val_y = model.predict([val_X], batch_size=1024, verbose=1)

In [None]:
from tqdm import tqdm
def threshold_search(y_true, y_proba):
#reference: https://www.kaggle.com/hung96ad/pytorch-starter
    best_threshold = 0
    best_score = 0
    for threshold in tqdm([i * 0.001 for i in range(1000)]):
        score = metrics.f1_score(y_true=y_true, y_pred=y_proba > threshold)
        if score > best_score:
            best_threshold = threshold
            best_score = score
    search_result = {'threshold': best_threshold, 'f1': best_score}
    return search_result
search_result = threshold_search(val_y, pred_noemb_val_y)
search_result

##### We can see the f1 score was 0.76 at threshold value of 0.159 

### Building model using Glove Embeddings

In [None]:
train = train_df.drop('target',axis = 1)
df = pd.concat([train ,test_df])

### Loading embeddings

In [None]:
def load_embed(file):
    def get_coefs(word,*arr): 
        return word, np.asarray(arr, dtype='float32')
    
    if file == '/kaggle/embeddings/wiki-news-300d-1M/wiki-news-300d-1M.vec':
        embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(file) if len(o)>100)
    else:
        embeddings_index = dict(get_coefs(*o.split(" ")) for o in open(file, encoding='latin'))
        
    return embeddings_index

In [None]:
glove = '/kaggle/input/embeddings/glove-840B-300d.txt'
print("Extracting GloVe embedding")
embed_glove = load_embed(glove)
print('Loaded glove embeddings sucessfully...')

### Vocabulary and Coverage functions

In [None]:
import operator 
def build_vocab(texts):
    sentences = texts.apply(lambda x: x.split()).values
    vocab = {}
    for sentence in sentences:
        for word in sentence:
            try:
                vocab[word] += 1
            except KeyError:
                vocab[word] = 1
    return vocab

In [None]:
def check_coverage(vocab, embeddings_index):
    known_words = {}
    unknown_words = {}
    nb_known_words = 0
    nb_unknown_words = 0
    for word in vocab.keys():
        try:
            known_words[word] = embeddings_index[word]
            nb_known_words += vocab[word]
        except:
            unknown_words[word] = vocab[word]
            nb_unknown_words += vocab[word]
            pass

    print('Found embeddings for {:.3%} of vocab'.format(len(known_words) / len(vocab)))
    print('Found embeddings for  {:.3%} of all text'.format(nb_known_words / (nb_known_words + nb_unknown_words)))
    unknown_words = sorted(unknown_words.items(), key=operator.itemgetter(1))[::-1]

    return unknown_words

In [None]:
vocab = build_vocab(df['text'])

In [None]:
print("Glove : ")
oov_glove = check_coverage(vocab, embed_glove)

In [None]:
def add_lower(embedding, vocab):
    count = 0
    for word in vocab:
        if word in embedding and word.lower() not in embedding:  
            embedding[word.lower()] = embedding[word]
            count += 1
    print(f"Added {count} words to embedding")

In [None]:
print("Glove : ")
oov_glove = check_coverage(vocab, embed_glove)
add_lower(embed_glove, vocab)
oov_glove = check_coverage(vocab, embed_glove)

### Contractions:

In [None]:
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have" }

In [None]:
def known_contractions(embed):
    known = []
    for contract in contraction_mapping:
        if contract in embed:
            known.append(contract)
    return known

In [None]:
print("- Known Contractions -")
print("   Glove :")
print(known_contractions(embed_glove))

In [None]:
def clean_contractions(text, mapping):
    specials = ["’", "‘", "´", "`"]
    for s in specials:
        text = text.replace(s, "'")
    text = ' '.join([mapping[t] if t in mapping else t for t in text.split(" ")])
    return text

In [None]:
df['treated_text'] = df['text'].apply(lambda x: clean_contractions(x, contraction_mapping))

In [None]:
vocab = build_vocab(df['treated_text'])
print("Glove : ")
oov_glove = check_coverage(vocab, embed_glove)

### Punctuations:

In [None]:
punct = "/-'?!.,#$%\'()*+-/:;<=>@[\\]^_`{|}~" + '""“”’' + '∞θ÷α•à−β∅³π‘₹´°£€\×™√²—–&'

In [None]:
def unknown_punct(embed, punct):
    unknown = ''
    for p in punct:
        if p not in embed:
            unknown += p
            unknown += ' '
    return unknown

In [None]:
print("Glove :")
print(unknown_punct(embed_glove, punct))

In [None]:
punct_mapping = {"‘": "'", "₹": "e", "´": "'", "°": "", "€": "e", "™": "tm", "√": " sqrt ", "×": "x", "²": "2", "—": "-", "–": "-", "’": "'", "_": "-", "`": "'", '“': '"', '”': '"', '“': '"', "£": "e", '∞': 'infinity', 'θ': 'theta', '÷': '/', 'α': 'alpha', '•': '.', 'à': 'a', '−': '-', 'β': 'beta', '∅': '', '³': '3', 'π': 'pi', }

In [None]:
def clean_special_chars(text, punct, mapping):
    for p in mapping:
        text = text.replace(p, mapping[p])
    
    for p in punct:
        text = text.replace(p, f' {p} ')
    
    specials = {'\u200b': ' ', '…': ' ... ', '\ufeff': ''}  # Other special characters that I have to deal with in last
    for s in specials:
        text = text.replace(s, specials[s])
    
    return text

In [None]:
df['treated_text'] = df['treated_text'].apply(lambda x: clean_special_chars(x, punct, punct_mapping))

In [None]:
vocab = build_vocab(df['treated_text'])
print("Glove : ")
oov_glove = check_coverage(vocab, embed_glove)

### Correcting spellings:

In [None]:
mispell_dict = {'colour': 'color', 'centre': 'center', 'favourite': 'favorite', 'travelling': 'traveling', 'counselling': 'counseling', 'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor', 'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 'youtu ': 'youtube ', 'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can', 'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do', 'doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 'mastrubation': 'masturbation', 'mastrubate': 'masturbate', "mastrubating": 'masturbating', 'pennis': 'penis', 'Etherium': 'Ethereum', 'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess', "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 'demonitization': 'demonetization', 'demonetisation': 'demonetization', 'pokémon': 'pokemon'}

In [None]:
def correct_spelling(x, dic):
    for word in dic.keys():
        x = x.replace(word, dic[word])
    return x

In [None]:
df['treated_text'] = df['treated_text'].apply(lambda x: correct_spelling(x, mispell_dict))

In [None]:
vocab = build_vocab(df['treated_text'])
print("Glove : ")
oov_glove = check_coverage(vocab, embed_glove)

### Correcting the spellings had not done any trick for us. As we can observe that found embeddings percentage was decreased.

### Preparing data

In [None]:
train['treated_text'] = train['text'].apply(lambda x: x.lower())
# Contractions
train['treated_text'] = train['text'].apply(lambda x: clean_contractions(x, contraction_mapping))
# Special characters
train['treated_text'] = train['text'].apply(lambda x: clean_special_chars(x, punct, punct_mapping))

In [None]:
def make_data(X):
    t = Tokenizer(num_words=max_features)
    t.fit_on_texts(X)
    X = t.texts_to_sequences(X)
    X = pad_sequences(X, maxlen=maxlen)
    return X, t.word_index

In [None]:
X, word_index = make_data(train['text'])

In [None]:
def make_treated_data(X):
    t = Tokenizer(num_words=max_features, filters='')
    t.fit_on_texts(X)
    X = t.texts_to_sequences(X)
    X = pad_sequences(X, maxlen=maxlen)
    return X, t.word_index

In [None]:
X_treated, word_index_treated = make_treated_data(train['treated_text'])

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X, train_df['target'].values, test_size=0.1, random_state=2020)

In [None]:
X_t_train, X_t_val, _, _ = train_test_split(X_treated, train_df['target'].values, test_size=0.1, random_state=2020)

In [None]:
def make_embed_matrix(embeddings_index, word_index, len_voc):
    all_embs = np.stack(embeddings_index.values())
    emb_mean,emb_std = all_embs.mean(), all_embs.std()
    embed_size = all_embs.shape[1]
    word_index = word_index
    embedding_matrix = np.random.normal(emb_mean, emb_std, (len_voc, embed_size))
    
    for word, i in word_index.items():
        if i >= len_voc:
            continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None: 
            embedding_matrix[i] = embedding_vector
    
    return embedding_matrix

In [None]:
embedding = make_embed_matrix(embed_glove, word_index, max_features)
del word_index
gc.collect()

In [None]:
embedding_treated = make_embed_matrix(embed_glove, word_index_treated, max_features)
del word_index_treated
gc.collect()

In [None]:
def modelling(embe_matrix):
    inp = Input(shape=(maxlen,))
    x = Embedding(max_features, 300, weights=[embe_matrix])(inp)
    x = Bidirectional(GRU(64, return_sequences=True))(x)
    x = GlobalMaxPool1D()(x)
    x = Dense(16, activation='relu')(x)
    x = Dropout(0.1)(x)
    x = Dense(1, activation="sigmoid")(x)
    model = Model(inputs=inp, outputs=x)
    model.compile(loss='binary_crossentropy', optimizer='Adam', metrics=['accuracy',f1])
    return model

In [None]:
model = modelling(embedding)

In [None]:
model_treated = modelling(embedding_treated)

In [None]:
history = model.fit(X_train, y_train, batch_size=512, epochs=10, 
                    validation_data=[val_X, val_y])
pred_val = model.predict(X_val, batch_size=512, verbose=1)

In [None]:
history = model_treated.fit(X_t_train, y_train, batch_size=512, epochs=10, 
                            validation_data=[X_t_val, y_val])
pred_t_val = model_treated.predict(X_t_val, batch_size=512, verbose=1)

In [None]:
search_result = threshold_search(y_val, pred_val)
search_result

##### We can observe that f1-score without text preprocessing was 0.799

In [None]:
search_result = threshold_search(y_val, pred_t_val)
search_result

##### We can observe that f1-score with text preprocessing was 0.82

## Therefore we can conclude that preprocessing the text with according to the embeddings we use,will help us to increase the score.

# Thats all for now.<font color='red'>Please consider Upvoting this kernel</font>.Suggestions are much appreciated to improve this kernel further. <br>
# <font color='blue'>Happy learning :)</font>

<a href="#top" class="btn btn-primary btn-lg active" role="button" aria-pressed="true">Go to TOP</a>