# Table of content
* Introduction
* Importing Necessary Library
* Loading The Dataset
* Cleaning The Text Data
* Convert Text To Machine Readable Form
* Model Creation
* Checking Accuracy
* Output Prediction
* Submission File

# Introduction

In [None]:
from IPython.display import Image

Image("../input/twitterimg/images.png", width = "400px")



### Twitter has become an important communication channel in times of emergency.
### The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).
### Our task in this project is to predict the twitter tweets is Disaster Tweets or Not, I will explain everything in a simple way.

# Importing Necessary Library

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

# Loading The Dataset

In [None]:
df=pd.read_csv("../input/nlp-getting-started/train.csv")

print(df.head())

In [None]:
df.shape

In [None]:
print(df.isnull().sum())

In [None]:
sns.countplot('target',data=df)

In [None]:
df["keyword"].value_counts()

### There are more missing values in the keyword and location so we can drop it.

In [None]:
data=df.drop(['location','keyword'],axis=1)
data.head()

# Cleaning The Text Data

### when ever we have a text data we have to clean the data for remove some unnecessary symboles,and uncesessary stopwords.

In [None]:
# Cleaning the reviews

corpus = []
for i in range(0,7613):

  # Cleaning special character from the tweets
  review = re.sub(pattern='[^a-zA-Z]',repl=' ', string=data['text'][i])#remove everything apart from capital A to Z and small a to z
  

  # Converting the entire tweets into lower case
  tweets = review.lower()

  # Tokenizing the tweetsby words
  tweets_words = tweets.split()
 
  # Removing the stop words
  tweets_words = [word for word in tweets_words if not word in set(stopwords.words('english'))]
  
  # lemmitizing  the words
  lemmatizer = WordNetLemmatizer()
  tweets= [lemmatizer.lemmatize(word) for word in tweets_words]

  # Joining the lemmitized words
  tweets = ' '.join(tweets)
  
  # Creating a corpus
  corpus.append(tweets)

### Lemmatization:
### Let’s start with Why we need lemmatization ?
### As textual data is non linear and there might be some noise present, so in order to remove the noise(unwanted stuff) we have to perform some tasks on the textual data. This process of removing noise is what we call normalization.
### Lemmatization is the one of the text normalization techniques. In lemmatization, the words are replaced by the root words or the words with similar context.
### E.g.- Walking will be replaced by Walk(walk is the root word of walking)

### Our cleaned text data:

In [None]:
corpus[:5]

# Convert Text To Machine Readable Form

### Every text will be converted into machnie read able form

In [None]:
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()
y = data['target']
print(X.shape)

### Every text will be converted like this:

![](https://cdn.analyticsvidhya.com/wp-content/uploads/2020/05/Screenshot-from-2020-05-21-12-46-42.png)

In [None]:
print(X)

# Model Creation

### Splitting data into training part and testing part

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [None]:
# Fitting Naive Bayes to the Training set
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train, y_train)


# Predicting the Test set results
y_pred = classifier.predict(X_test)
print(y_pred)

# Checking Accuracy

In [None]:
accuracy=confusion_matrix(y_test,y_pred )
print("confusion_matrix:",accuracy)


In [None]:
accuracy=accuracy_score(y_test,y_pred )
print("accuracy_score:",accuracy)


In [None]:

print(classification_report(y_test,y_pred ))

### We have to do samething for the test dataset

In [None]:
data_test=pd.read_csv("../input/nlp-getting-started/test.csv")
print(data_test.head())

In [None]:
data_test.isnull().sum()


In [None]:
data_tes=data_test.drop(['keyword','location'],axis=1)
data_tes.shape


### Cleaning text in testing dataset

In [None]:

corpus1 = []
for i in range(0,3263):


  # Cleaning special character from the tweets
  review = re.sub(pattern='[^a-zA-Z]',repl=' ', string=data_tes['text'][i])
  

  # Converting the entire tweets into lower case
  tweets = review.lower()

  # Tokenizing the tweets by words
  tweets_words = review.split()
 
  # Removing the stop words
  tweets_words = [word for word in tweets_words if not word in set(stopwords.words('english'))]
  
  # lemmitizing the words
  lemmatizer = WordNetLemmatizer()
  tweets = [lemmatizer.lemmatize(word) for word in tweets_words]

  # Joining the lemmitized words
  tweets = ' '.join(tweets)

  y_pred=cv.transform([review]).toarray()
  pre=classifier.predict(y_pred)
  corpus1.append(pre)

print(len(corpus1))


# Submission File

In [None]:

# Create a submisison dataframe and append the relevant columns
submission = pd.DataFrame()

submission['id'] = data_tes['id']
submission['target'] = corpus1

In [None]:
# Let's convert our submission dataframe 'Survived' column to ints
submission['target'] = submission['target'].astype(int)
print('Converted Survived column to integers.')

print(submission.head())




# Are our test and submission dataframes the same length?
if len(submission) == len(data_tes):
    print("Submission dataframe is the same length as test ({} rows).".format(len(submission)))
else:
    print("Dataframes mismatched, won't be able to submit to Kaggle.")

# Convert submisison dataframe to csv for submission to csv 
# for Kaggle submisison
submission.to_csv('../submission_nlp1.csv', index=False)
print('Submission CSV is ready!')

# If  this Notebook is useful for you please upvote it !!!!