# Natural Language Processing with Disaster Tweets
**Predict which Tweets are about real disasters and which ones are not**

Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency theyâ€™re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

In this competition, youâ€™re challenged to build a machine learning model that predicts which Tweets are about real disasters and which oneâ€™s arenâ€™t. Youâ€™ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running.

**Disclaimer:** The dataset for this competition contains text that may be considered profane, vulgar, or offensive.


![Disaster](https://images.unsplash.com/photo-1536245344390-dbf1df63c30a?ixid=MXwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHw%3D&ixlib=rb-1.2.1&auto=format&fit=crop&w=752&q=80)

## Introduction 
This notebook is strictly for beginners and it is your entryway to the world of natural language processing. I have used a dataset from [Kaggle competition] ("https://www.kaggle.com/c/nlp-getting-started") and used simple tools for cleaning and training text data.

**I will show you how to : **
- Analyze dataset
- Visualization of Keywords
- Cleaning data
- Wordcloud
- Tokenization
- Vectorization
- Training with a simple model
- Model Metrics (F1)
- predictions from the test dataset.

## Importing Required Library 

In [None]:
## All purpose library
import pandas as pd
import numpy as np

## NLP library
import re
import string
import nltk
from nltk.corpus import stopwords

## ML Library
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.model_selection import RepeatedStratifiedKFold,cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import f1_score

## Visualization library
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

## Ignoring Warning during trainings 
import warnings
warnings.filterwarnings('ignore')

## Analyzing dataset

In [None]:
## using pandas read_csv funtion to load csv files
train=pd.read_csv("../input/nlp-getting-started/train.csv")
test=pd.read_csv("../input/nlp-getting-started/test.csv")

## Displying the dataframe of both training and testing
print("Training Data")
display(train.head(3))
print("Testing Data")
display(test.head(3))

In [None]:
## Shape of Datasets
print("Train Dataset shape:\n",train.shape,"\n") ## (7613 rows, 5 Columns)
print("Test Dataset shape:\n",test.shape) ## (3263 rows, 4 Columns)

### Checking Missing Values

In [None]:
## using isnull will give us bollean data and suming all true will give exact number of missing values.
print("Train Dataset missing data:\n",train.isnull().sum(),"\n")
print("Test Dataset missing data:\n",test.isnull().sum())

## Visualization

In [None]:
## using pandas value counts on target will give us number of 0's with is non disaster tweets,
## and 1's which is disaster tweets. 
VCtrain=train['target'].value_counts().to_frame()

## seaborn barplot to display barchart
sns.barplot(data=VCtrain,x=VCtrain.index,y="target",palette="viridis")
VCtrain

### Going deep into disaster tweets

In [None]:
## Going deep into disaster Tweets
display("Random sample of disaster tweets:",train[train.target==1].text.sample(3).to_frame())
display("Random sample of non disaster tweets:",train[train.target==0].text.sample(3).to_frame())

### Most common keywords

In [None]:
common_keywords=train["keyword"].value_counts()[:20].to_frame()
fig=plt.figure(figsize=(15,6))
sns.barplot(data=common_keywords,x=common_keywords.index,y="keyword",palette="viridis")
plt.title("Most common keywords",size=16)
plt.xticks(rotation=70,size=12);

## Using pie chart 

In [None]:

train[train.text.str.contains("disaster")].target.\
 value_counts().to_frame().rename(index={1:"Disaster",0:"normal"}).\
  plot.pie(y="target",figsize=(12,6),title="Tweets with Disaster mentioned");

### Location of Tweets

In [None]:
train.location.value_counts()[:10].to_frame()

## Text Cleaning

In [None]:
# lowering the text
train.text=train.text.apply(lambda x:x.lower() )
test.text=test.text.apply(lambda x:x.lower())
#removing square brackets
train.text=train.text.apply(lambda x:re.sub('\[.*?\]', '', x) )
test.text=test.text.apply(lambda x:re.sub('\[.*?\]', '', x) )
train.text=train.text.apply(lambda x:re.sub('<.*?>+', '', x) )
test.text=test.text.apply(lambda x:re.sub('<.*?>+', '', x) )
#removing hyperlink
train.text=train.text.apply(lambda x:re.sub('https?://\S+|www\.\S+', '', x) )
test.text=test.text.apply(lambda x:re.sub('https?://\S+|www\.\S+', '', x) )
#removing puncuation
train.text=train.text.apply(lambda x:re.sub('[%s]' % re.escape(string.punctuation), '', x) )
test.text=test.text.apply(lambda x:re.sub('[%s]' % re.escape(string.punctuation), '', x) )
train.text=train.text.apply(lambda x:re.sub('\n' , '', x) )
test.text=test.text.apply(lambda x:re.sub('\n', '', x) )
#remove words containing numbers
train.text=train.text.apply(lambda x:re.sub('\w*\d\w*' , '', x) )
test.text=test.text.apply(lambda x:re.sub('\w*\d\w*', '', x) )

train.text.head()

## Word cloud of tweets

In [None]:
disaster_tweets = train[train['target']==1]['text']
non_disaster_tweets = train[train['target']==0]['text']

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=[16, 8])
wordcloud1 = WordCloud( background_color='white',
                        width=600,
                        height=400).generate(" ".join(disaster_tweets))
ax1.imshow(wordcloud1)
ax1.axis('off')
ax1.set_title('Disaster Tweets',fontsize=40);

wordcloud2 = WordCloud( background_color='white',
                        width=600,
                        height=400).generate(" ".join(non_disaster_tweets))
ax2.imshow(wordcloud2)
ax2.axis('off')
ax2.set_title('Non Disaster Tweets',fontsize=40);

## Tokenization
Tokenization is the process of tokenizing or splitting a string, text into a list of tokens. One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.

In [None]:
#Tokenizer
token=nltk.tokenize.RegexpTokenizer(r'\w+')
#applying token
train.text=train.text.apply(lambda x:token.tokenize(x))
test.text=test.text.apply(lambda x:token.tokenize(x))
#view
display(train.text.head())

In [None]:
nltk.download('stopwords')
#removing stop words
train.text=train.text.apply(lambda x:[w for w in x if w not in stopwords.words('english')])
test.text=test.text.apply(lambda x:[w for w in x if w not in stopwords.words('english')])
#view
train.text.head()

In [None]:
test.text.head()

## Stemming
Stemming and Lemmatization in Python NLTK are text normalization techniques for Natural Language Processing. These techniques are widely used for text preprocessing. The difference between stemming and lemmatization is that stemming is faster as it cuts words without knowing the context, while lemmatization is slower as it knows the context of words before processing.

**In this case PoerterStemmer performed well then lemmatization**

In [None]:
#stemmering the text and joining
stemmer = nltk.stem.PorterStemmer()
train.text=train.text.apply(lambda x:" ".join(stemmer.stem(token) for token in x))
test.text=test.text.apply(lambda x:" ".join(stemmer.stem(token) for token in x))
#View
train.text.head()

## Text Vectorization
Machine learning algorithms most often take numeric feature vectors as input. Thus, when working with text documents, we need a way to convert each document into a numeric vector.

**In this case Countvectorizer is best performing.**

In [None]:
count_vectorizer = CountVectorizer()
train_vectors_count = count_vectorizer.fit_transform(train['text'])
test_vectors_count = count_vectorizer.transform(test["text"])



## Using Logistic Regression for Training Model

In [None]:
# Fitting a simple Logistic Regression on Counts
CLR = LogisticRegression(C=2)
scores = cross_val_score(CLR, train_vectors_count, train["target"], cv=6, scoring="f1")
scores

## Using Simple Naive Bayes
Our simple Logistics Regression worked poor in F1 score, so I decided to chose another model for training, you can chose any gradient boosting or simple linear model to train our data.

In [None]:
# Fitting a simple Naive Bayes
NB_Vec = MultinomialNB()
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=3, random_state=1)
scores = cross_val_score(NB_Vec, train_vectors_count, train["target"], cv=cv, scoring="f1")
scores

This is the best score I can come up with experimenting on various text vectors, text cleaning, and simple model implementations. 

## Fitting model and predicting the test data.

In [None]:
NB_Vec.fit(train_vectors_count, train["target"])

In [None]:
pred=NB_Vec.predict(test_vectors_count)

## Final Submission into Competition 

In [None]:
sample_submission = pd.read_csv("../input/nlp-getting-started/sample_submission.csv")
sample_submission["target"] = pred
sample_submission.to_csv("submission.csv", index=False)

# You can submit your score in this competition and see where you stand in leaderboard.

## If you like my work do upvote ðŸ‘† it and share it with others.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=64278d77-b455-4fcb-b98a-076ff504a9ee' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>