# **Introduction**
<div class="column">
<img src="https://media-exp1.licdn.com/dms/image/C561BAQGEbzpXZ34-gQ/company-background_10000/0?e=2159024400&v=beta&t=o3vOn3Ye-qpqlDH64A1of1_aRAQ8TunahPQ4ZWuISRI" style="width:650px;height:350px;">
    </div><br>
<b>In this kernel we will go together into Disaster Tweets data and will do some data analysis, data cleaning and then create a simple NLP model to predict whether the tweet is about real disaster or not. I have tried to explain all the steps so that even if this is your first nlp problem, you will not get any confusion in any step.</b><br><br>

##  **<font color="red"> Please do an upvote if you find my kernel useful.</font>**

# **Table of Content**
* [Importing necesseties](#1)
* [Reading the data](#2)
* [EDA](#3)
* [Data Cleaning](#4)
* [Model](#5)


<a id = '1'></a>
# **Importing necesseties**

In [None]:
import numpy as np
import pandas as pd 
import os
import re
import string

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
plt.style.use('seaborn')
from plotly import graph_objs as go
import plotly.express as px
from collections import Counter
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS
from PIL import Image

import keras
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Embedding, Flatten, Dropout
from keras.optimizers import Adam

<a id = '2'></a>
# **Reading the data**

In [None]:
df1 = pd.read_csv('../input/nlp-getting-started/train.csv')
df2 = pd.read_csv('../input/nlp-getting-started/test.csv')
submit = pd.read_csv('../input/nlp-getting-started/sample_submission.csv')

In [None]:
print(df1.shape)
print(df2.shape)

**So we have 7613 tweets in the train set and 3263 tweets in the test set**

In [None]:
df1.info()

In [None]:
df1.head()

<a id = '3'></a>
# **EDA**

**Let's see how many tweets are disaster and non-disaster tweets**

In [None]:
temp = df1.groupby('target').count()['text'].reset_index()
temp['label'] = temp['target'].apply(lambda x : 'Disaster Tweet' if x==1 else 'Non Disaster Tweet')
temp

In [None]:
plt.figure(figsize=(7,5))
sns.countplot(x='target',data=df1)

In [None]:
fig = go.Figure(go.Funnelarea(
    text = temp.label,
    values = temp.text,
    title = {"position" : "top center", "text" : "Funnel Chart for target distribution"}
    ))
fig.show()

**Target Distribution in Keywords**

In [None]:
df1['target_mean'] = df1.groupby('keyword')['target'].transform('mean')

fig = plt.figure(figsize=(8, 78), dpi=100)

sns.countplot(y=df1.sort_values(by='target_mean', ascending=False)['keyword'],
              hue=df1.sort_values(by='target_mean', ascending=False)['target'])

plt.tick_params(axis='x', labelsize=15)
plt.tick_params(axis='y', labelsize=12)
plt.legend(loc=1)
plt.title('Target Distribution in Keywords')

plt.show()

df1.drop(columns=['target_mean'], inplace=True)

**Number of words in a tweet**

In [None]:
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(10,5))
tweet_len=df1[df1['target']==1]['text'].str.split().map(lambda x: len(x))
ax1.hist(tweet_len,color='red')
ax1.set_title('Disaster Tweets')
tweet_len=df1[df1['target']==0]['text'].str.split().map(lambda x: len(x))
ax2.hist(tweet_len,color='blue')
ax2.set_title('Non Disaster Tweets')
fig.suptitle('No.of words in a tweet')
plt.show()

**Now let us observe the common words in the tweet. But first we will convert all the text in lowercase so that same words with different case are not counted differently.**

In [None]:
def clean_text(text):
    text = str(text).lower()
    return text

df1['text_plot'] = df1['text'].apply(lambda x:clean_text(x))

df1['temp_list'] = df1['text_plot'].apply(lambda x:str(x).split())
top = Counter([item for sublist in df1['temp_list'] for item in sublist])
temp = pd.DataFrame(top.most_common(25))
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Blues')

In [None]:
fig = px.bar(temp, x='count',y='Common_words',title='Common words in tweet',orientation='h',width=700,height=700,color='Common_words')
fig.show()

**Now we will remove all the stopwords and then will observe the common words graphically.**

In [None]:
def remove_stopwords(x):
    return [y for y in x if y not in stopwords.words('english')]
df1['temp_list'] = df1['temp_list'].apply(lambda x : remove_stopwords(x))

In [None]:
top = Counter([item for sublist in df1['temp_list'] for item in sublist])
temp = pd.DataFrame(top.most_common(25))
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Purples')

In [None]:
fig = px.treemap(temp, path=['Common_words'], values='count',title='Tree of Most Common Words')
fig.show()

**Let us also visualize the wordcloud**

In [None]:
text = df1['text'].values
twitter_logo = np.array(Image.open('../input/twitter-logo2/10wmt-articleLarge-v4.jpg'))
cloud = WordCloud(
                          stopwords=STOPWORDS,
                          background_color='white',
                          mask = twitter_logo,
                          max_words=200
                         ).generate(" ".join(text))

plt.imshow(cloud)
plt.axis('off')
plt.show()

<a id = '4'></a>
# **Data Cleaning**

We will apply data cleaning steps to both our training and test data. Here we are going to concat them so that we don't have to apply each steps separately. Then later on after applying data cleaning process we will separate them.

In [None]:
del df1['text_plot']
del df1['temp_list']

df = pd.concat([df1,df2])

In [None]:
df.head()

**First we will fill all the null values with no_{column name}.**

In [None]:
for col in ['keyword', 'location']:
    df[col] = df[col].fillna(f'no_{col}')

In [None]:
df.head()

**As we observed from EDA that we have to remove many things like url, html tags, punctuation marks etc.**

**So now we will remove all the all the urls and the HTML tags**

In [None]:
df['text']=df['text'].str.replace('https?://\S+|www\.\S+','').str.replace('<.*?>','')

**Removing all the emojis**

In [None]:
def remove_emoji(text):
    emoji_pattern = re.compile("["
                           u"\U0001F600-\U0001F64F"  # emoticons
                           u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                           u"\U0001F680-\U0001F6FF"  # transport & map symbols
                           u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           u"\U00002702-\U000027B0"
                           u"\U000024C2-\U0001F251"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

In [None]:
df['text'] = df['text'].apply(lambda x : remove_emoji(x))

**Removing punctuation marks**

In [None]:
def remove_punct(text):
    table=str.maketrans('','',string.punctuation)
    return text.translate(table)

In [None]:
df['text'] = df['text'].apply(lambda x : remove_punct(x))

**Remove leading, trailing, and extra spaces**

In [None]:
def clean_text(text):
    text = re.sub('\s+', ' ', text).strip() 
    return text

In [None]:
df['text'] = df['text'].apply(lambda x : clean_text(x))

**Now as we have applied all the cleaning steps so now its time to separate our data back.**

In [None]:
dfs = np.split(df, [len(df1)], axis=0)

In [None]:
train = dfs[0]
train.shape

In [None]:
test = dfs[1]
test.shape

In [None]:
test.drop('target',axis=1,inplace=True)

<a id = '5'></a>
# Model
After doing all the data analysis and applying data cleaning process, now its time to create our model. 

Natural Language Processing (NLP) is a field in machine learning with the ability of a computer to understand, analyze, manipulate, and potentially generate human language.Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.We will be using Keras for creating our own NLP model.

**Firstly we define vocabulary size as len(test). That means this system here will support len(test) different words.**

In [None]:
vocab_size = len(test)
text = train['text'].values
label = train['target'].values

**As each word is just a sequence of characters and, obviously, we cannot work with sequence of characters. Therefore, we will convert each word into an integer number, and this integer number is unique, as long as we don't exceed the vocabulary size. It's not the one-hot encoding. It's basically just the transformation from a list of the words into a list of integer values.**

In [None]:
encoded_docs = [one_hot(d,vocab_size) for d in text]
for x in range(5):
    print(encoded_docs[x])

**Now here, we are actually padding that means, if the sentence is not long enough, we are just filling it with zeros.**

In [None]:
max_len = len(train['text'].max())
pad_docs = pad_sequences(encoded_docs,maxlen=max_len,padding='post')

In [None]:
train.shape

**Now we will define our model.**

Here we are creating a Sequential model. Then we have one Embedding layer with vocab_size=7613,dimension=100 and input_length=max_len, one Dropout layer, one Flatten layer. Then we add 2 Dense layers with 1024 parameters and activation function is relu and a Dropout layer after each Dense Layer. In the end we add one final Dense layer with output_class=1 and activation function is sigmoid.

Then we will compile our model with Adam optimizer and binary_crossentropy as loss function.

In [None]:
model = Sequential()
model.add(Embedding(7613,100,input_length=max_len))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1024, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1,activation='sigmoid'))
model.compile(optimizer='Adam',loss='binary_crossentropy',metrics=['acc'])

In [None]:
model.summary()

**Now we will train our model with our padded data pad_docs and label with 20 epochs and batch_size=32.**

In [None]:
model.fit(pad_docs,label,epochs=20,batch_size=32)

**Let's check our predictions**

In [None]:
prediction = model.predict(pad_docs)

In [None]:
train['prediction'] = prediction

In [None]:
train['prediction'] = train['prediction'].apply(lambda x : 0 if x<0.5 else 1)

In [None]:
train.head()

**Now we will apply our trained model to the test data. But before that we have to also convert text of test data to padded data like we did earlier.**

In [None]:
text2 = test['text'].values
encoded_docs2 = [one_hot(d,vocab_size) for d in text2]
pad_docs2 = pad_sequences(encoded_docs2,maxlen=max_len,padding='post')

In [None]:
prediction2 = model.predict(pad_docs2)

In [None]:
test['prediction'] = prediction2
test['prediction'] = test['prediction'].apply(lambda x : 0 if x<0.5 else 1)

In [None]:
test.head()

**Submitting our predictions**

In [None]:
submit['target'] = test['prediction']

In [None]:
submit.head()

In [None]:
submit.to_csv('submission.csv',index=False)

**This is my first kaggle notebook and I hope I have tried to explain each and every step. I will be back with new ideas and models as I learn more about different machine learning models. Please if you want to give me any suggestion or any doubt in any step comment below.**

#  **<font color="red"> Please do an upvote if you liked my kernel!</font>**