#                                   NLP with disaster tweets

### Twitter is an American microblogging and social networking service on which users post and interact with messages known as "tweets". 

### Here we are predicting whether a given tweet is about a real disaster or not.

Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

# Importing Libraries

We are importing libraries nltk,numpy,pandas and sklearn.
The Natural Language ToolKit is one of the best-known and most-used NLP libraries, useful for all sorts of tasks from t tokenization, stemming, tagging, parsing, and beyond.

In [None]:
import pandas as pd 
import numpy as np
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer,PorterStemmer
from nltk.corpus import stopwords
import re
import matplotlib.pyplot as plt
import seaborn as sns
import string

# Loading dataset

There are three files train.csv, test.csv and sample_submission.csv

Each sample in the train and test set has the following information:

* The text of a tweet
* A keyword from that tweet (although this may be blank!)
* The location the tweet was sent from (may also be blank)

### Columns
* id - a unique identifier for each tweet
* text - the text of the tweet
* location - the location the tweet was sent from (may be blank)
* keyword - a particular keyword from the tweet (may be blank)
* target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

The dataset contains 10,000 tweets that are hand classified.

In [None]:
train_df=pd.read_csv('../input/nlp-getting-started/train.csv')
test_df=pd.read_csv('../input/nlp-getting-started/test.csv')
sample_submission_df=pd.read_csv('../input/nlp-getting-started/sample_submission.csv')

## Visualization

In [None]:
target=train_df['target'].value_counts()
sns.barplot(target.index,target,edgecolor=(0,0,0),linewidth=1.5)
plt.title('Comparing disaster tweets and non disaster tweets',fontsize=15)
plt.xticks(fontsize=20)
plt.ylabel('Samples',fontsize=15)

In [None]:
keyword=train_df['keyword'].value_counts()[:20]
plt.figure(figsize=(10,7))
sns.barplot(keyword.index,keyword.values,edgecolor=(0,0,0),linewidth=2)
plt.title('Top 20 keywords',fontsize=20)
plt.xticks(fontsize=20,rotation=270)
plt.yticks(fontsize=20)
plt.xlabel('Keywords',fontsize=20,color='blue')

In [None]:
location=train_df['location'].value_counts()[:20]
plt.figure(figsize=(10,7))
sns.barplot(location.index,location.values,edgecolor=(0,0,0),linewidth=2)
plt.title('Top 20 Location',fontsize=20)
plt.xticks(fontsize=20,rotation=270)
plt.yticks(fontsize=20)
plt.xlabel('Locations',fontsize=20,color='blue')

In [None]:
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))
tweet_len=train_df[train_df['target']==1]['text'].str.len()
ax1.hist(tweet_len,color='red')
ax1.set_title('disaster tweets')
tweet_len=train_df[train_df['target']==0]['text'].str.len()
ax2.hist(tweet_len,color='blue')
ax2.set_title('Not disaster tweets')
fig.suptitle('Characters in tweets',fontsize=20)

plt.show()

In [None]:
fig,(ax1,ax2)=plt.subplots(1,2,figsize=(10,5))
tweet_len=train_df[train_df['target']==1]['text'].str.split().map(lambda x: len(x))
ax1.hist(tweet_len,color='red')
ax1.set_title('disaster tweets')
tweet_len=train_df[train_df['target']==0]['text'].str.split().map(lambda x: len(x))
ax2.hist(tweet_len,color='blue')
ax2.set_title('Not disaster tweets')
fig.suptitle('Words in a tweets',fontsize=20)
plt.show()

# Text Processing

Text Processing is one of the most common task in many ML applications.


The text column in our dataset contains hyperlinks, punctuation, stop words, numbers. So we have to remove all these using text processing. 

## Convert text to lowercase


In [None]:
def lower(words):
    return words.lower()
train_df['text']=train_df['text'].apply(lambda x:lower(x))


## Remove numbers

In [None]:
def remove_numbers(words):
    return re.sub(r'\d+','',words)
train_df['text']=train_df['text'].apply(lambda x: remove_numbers(x))


## Remove punctuation

In [None]:
 def remove_punctuation(words):
    table=str.maketrans('','',string.punctuation)
    return words.translate(table)
train_df['text']=train_df['text'].apply(lambda x: remove_punctuation(x))


## Tokenization

Tokenization is the first step in NLP. It is the process of breaking strings into tokens which in turn are small structures or units. Tokenization involves three steps which are breaking a complex sentence into words, understanding the importance of each word with respect to the sentence and finally produce structural description on an input sentence.

In [None]:
train_df['text']=train_df['text'].apply(lambda x:word_tokenize(x))


## Removing Stop words

Stop words are the most common words in a language like “the”, “a”, “at”, “for”, “above”, “on”, “is”, “all”. These words do not provide any meaning and are usually removed from texts. We can remove these stop words using nltk library

In [None]:
def remove_stopwords(words):
    stop_words=set(stopwords.words('english'))
    return [word for word in words if word not in stop_words]
train_df['text']=train_df['text'].apply(lambda x: remove_stopwords(x))

## Removing links

In [None]:
def remove_links(words):
    
    return [re.sub(r'(https?://\S+)','',word)for word in words]
train_df['text']=train_df['text'].apply(lambda x:remove_links(x))
                 

## Stemming 

Stemming usually refers to normalizing words into its base form or root form.

In [None]:
# def stemming(words):
#     ps=PorterStemmer()
#     return [ps.stem(word) for word in words]
# train_df['text']=train_df['text'].apply(lambda x: stemming(x))


## Lemmatizing

In simpler terms, it is the process of converting a word to its base form. The difference between stemming and lemmatization is, lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.

In [None]:
def lemmatizing(words):
    lemmatizer =WordNetLemmatizer()
    return [lemmatizer.lemmatize(word) for word in words]
train_df['text']=train_df['text'].apply(lambda x: lemmatizing(x))


In [None]:
def final_text(words):
     return ' '.join(words)
train_df['text']=train_df['text'].apply(lambda x:final_text(x))
   

## Machine Learning 

The words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or vectorization).

The scikit-learn library offers easy-to-use tools to perform both tokenization and feature extraction of your text data.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vect=TfidfVectorizer(min_df=2
                      ,max_features = None,analyzer="word",  ngram_range=(1,3) # (1,6)
                           ).fit(train_df['text'])
x_train_vect=vect.transform(train_df['text'])


In [None]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.linear_model import LogisticRegression
model=LogisticRegression()
model.fit(x_train_vect,train_df['target'])



## Predictions

In [None]:
predictions=model.predict(vect.transform(test_df['text']))

## Submission

In [None]:

sample_submission_df['target']=predictions
sample_submission_df.to_csv('submission.csv',index=False)
