# Disaster tweet prediction

Hey everyone! This project is a beginner project for **natural language processing**!
In this notebook I'll go step by step on how to solve this classification problem and try to submit the result in the competition and check for accuracy! and try to improve it too! 
Happy learning!

Let's start by importing the necessary libraries!

In [None]:
import numpy as np
import pandas as pd
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords # these are basically the words which don't convey much meaning like a the an etc.
from nltk.stem.porter import PorterStemmer # this is used to stem the word like for eg if we have loved --> love!
from sklearn.feature_extraction.text import CountVectorizer #to vectorize the words into a vector of frequent words count!
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

Let's look at our stop words!

In [None]:
print(stopwords.words('english')) # These words don't really give us much info

Let's load the dataset!

In [None]:
tweet_dataset = pd.read_csv('../input/nlp-getting-started/train.csv')
tweet_dataset_test = pd.read_csv('../input/nlp-getting-started/test.csv')
tweet_dataset.shape

Let's check the features in the training set!

In [None]:
tweet_dataset.head()

In [None]:
tweet_dataset.info()

Let's visualize the number of 
* 0 --> Not an actual disaster!
* 1 --> Actual disaster!

In [None]:
plt.figure(figsize=(8,5))
sns.countplot(x=tweet_dataset['target'])
plt.title('Target vs count',fontsize=20)
plt.xlabel('Target',fontsize=15)
plt.ylabel('Count',fontsize=15);

Let's check for any missing data!

In [None]:
tweet_dataset.isnull().sum()

Alright! so we're gonna use the keyword, location and the text for analysis and check to see if we get a better accuracy!
To fill the **null values** let's fill it with empty string **''**


In [None]:
tweet_dataset = tweet_dataset.fillna('')

Let's add a new column named context with the required text combined!

In [None]:
tweet_dataset['context'] =  tweet_dataset['text'] + " " + tweet_dataset['location'] + ' ' + tweet_dataset['keyword']

In [None]:
tweet_dataset.head()

Now let's create a function that would skim the column context we created, skim, clean, and remove the stopwords!

In [None]:
stem = PorterStemmer() # basically creating an object for stemming! Stemming is basically getting the root word, for eg: loved --> love! 

In [None]:
# now let's create a function to preprocess a cell and then apply it to the entire feature!
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]', ' ',content) # this basically replaces everything other than lower a-z & upper A-Z with a ' ', for eg apple,bananna --> apple bananna
    stemmed_content = stemmed_content.lower() # to make all text lower case
    stemmed_content = stemmed_content.split() # this basically splits the line into words with delimiter as ' '
    stemmed_content = [stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')] # basically remove all the stopwords and apply stemming to the final data
    stemmed_content = ' '.join(stemmed_content) # this basically joins back and returns the cleaned sentence
    return stemmed_content

Let's apply the stemming function on our column context

In [None]:
# let's apply the function on our feature content
tweet_dataset['context'] = tweet_dataset['context'].apply(stemming)

Let's split our training data into labels and text  so we can train our classifier model!

In [None]:
X = tweet_dataset['context'].values
y = tweet_dataset['target'].values

Now that we have our text! we have to vectorize the text so we can feed it to our classifer model! we have to convert the cleaned text into numbers rather in a vector that the model will understand and learn from. Basically it creates a vector with the count of the words!

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(X)

Now that we have X and y let's train our model on a classifier model!

In [None]:
#classifier = LogisticRegression()
#classifier = svm.SVC(kernel = 'linear')
#classifier = KNeighborsClassifier(n_neighbors = 10,weights = 'uniform', metric = 'minkowski' , p=2)
classifier = svm.SVC(kernel = 'rbf', C=1, gamma =0.1)
#classifier = GaussianNB()
#classifier = DecisionTreeClassifier(criterion='entropy')
#classifier = RandomForestClassifier(criterion = 'entropy', max_depth = 8, max_features  = 'sqrt', n_estimators= 10)
#from xgboost import XGBClassifier
#classifier = XGBClassifier(use_label_encoder=False, eval_metric = 'error')
classifier.fit(X, y)

Now let's analyse our model and evaluate accuracy! We have got a excellent accuracy using logistic classifier but to not leave it to chance that we got lucky on the train set let's check the accuracy using K-Fold cross validation and tune the hyper parameters if possible!

In [None]:
# accuracy score on training data
y_pred_train = classifier.predict(X)
accuracy_train = accuracy_score(y,y_pred_train)
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier,X = X,y= y , cv = 10)


print("-------------------------------")
print("Accuracy score on training data: ", accuracy_train)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f}".format(accuracies.std()*100))

In [None]:
from sklearn.metrics import confusion_matrix
cf_matrix = confusion_matrix(y, y_pred_train)
print(cf_matrix)

In [None]:
#.  visualizing the confusion matrix!
sns.heatmap(cf_matrix/np.sum(cf_matrix), annot=True, 
            fmt='.2%', cmap='Blues')

In [None]:
#Visualzing. with labels!
labels = ['True Neg','False Pos','False Neg','True Pos']
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')

So as we can see the accuracy,
* logistic regression - 56%
* svm linear kernel - 54%
* knn - 56%
* svm kernel -  57%
* naive bayes - 52%
* decision tree - 52%
* random forest - 58.56%
* XGB  - 55%

Let's try to improve the accuracy!

In [None]:
# from sklearn.model_selection import GridSearchCV
# parameters = [{'C':[0.25,0.5,0.75,1], 'kernel' : ['linear']},
#               {'C':[0.25,0.5,0.75,1], 'kernel' : ['rbf'], 'gamma' : [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]}]
# grid_search = GridSearchCV(estimator=classifier,
#                           param_grid=parameters,
#                           scoring='accuracy',
#                           cv=10)
# grid_search.fit(X,y)
# print("Best Accuracy: {:.2f} %".format(grid_search.best_score_*100))
# print("Best Parameters: ", grid_search.best_params_)

Let's try using random forest for our test set! and check accuracy!

In [None]:
tweet_dataset_test.isnull().sum()
tweet_dataset_test =  tweet_dataset_test.fillna('')
tweet_dataset_test['context'] =  tweet_dataset_test['text'] + " " + tweet_dataset_test['location'] + ' ' + tweet_dataset_test['keyword']
tweet_dataset_test.head()

In [None]:
tweet_dataset_test['context'] = tweet_dataset_test['context'].apply(stemming)
tweet_dataset_test.head()

In [None]:
X_test = tweet_dataset_test['context']
X_test = vectorizer.transform(X_test)

In [None]:
y_pred_test = classifier.predict(X_test)

Let's create our submission file!

In [None]:
results = pd.DataFrame(tweet_dataset_test['id'],columns=['id'])
results['target'] = y_pred_test
results.shape


In [None]:
tweet_dataset_test.shape


In [None]:
results.to_csv('Results.csv',index=False)