# Natural Language Processing

Here, we will work with 'Real or Not? NLP with Disaster Tweets' data base using NLP. This data have keywords, location, target and text of some tweets. The challenge is predict if some of the 'test' tweets talks or not about disasters.

On this notebook i show the methods what i used to get 0.79252 score.

**If you liked this notebook or found something useful in it, please give it a upvote!**

**If you have some ideas to improve the notebook, please, tell me on the coments.**



## **Imports**

In [None]:
import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
 
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer

import re

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score

import warnings
warnings.filterwarnings('ignore')

In [None]:
# The data start divided in two parts, train and test
# The test data is what we want predict, but to do this we have to use the train data
train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')

In [None]:
# Data Length (Rows, Columns)
train.shape, test.shape

## Data cleaning

Here we will make a data cleaning, On that case, this means verify the missing values and treat the tweet text. 

In [None]:
data = pd.concat([train, test], axis=0) # Concatenate the data 

In [None]:
data.head()

In [None]:
data.isnull().sum()

Here we see all the missing values. 
So let's modify them(except the target for now).

In [None]:
# Change all the strings to small letters 
# That makes easier to treat the data
data['text'] = data['text'].str.lower()
data['keyword'] = data['keyword'].str.lower()
data['location'] = data['location'].str.lower()

In [None]:
# Setting the index of the columns iqual the row id
data.set_index('id', inplace=True)

Now that is a important part on this type of prediction.
Here we use a lambda function to take out some special characters 

In [None]:
data['text'] = data['text'].apply(lambda x: re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "",x))

### STOPWORD
This function removes from the data some words that harm the model
Exemple: is, he, has...


In [None]:
def remove_words(data, col):
    stop = stopwords.words('english')
    list_of_lists = data[col].str.split()
    for idx, _ in data.iterrows():
        data[col].at[idx] = [word for word in list_of_lists[idx] \
                             if word not in stop]

In [None]:
remove_words(data,'text')

### Lemmatizer
This function removes from the data some words that harm the model
Exemple: is, he, has...

In [None]:
def get_word_variation(data, col):
    lemmatizer = WordNetLemmatizer()
    for idx, _ in data.iterrows():
        data[col].at[idx] = [lemmatizer.lemmatize(palavra,'v') \
                             for palavra in data[col][idx]]

In [None]:
get_word_variation(data, 'text')

In [None]:
data.loc[data['keyword'].notnull()].head(10)

Visualizing this data you can see some location and keywords on the text. So, lets do that with nan loaction and keywords.

In [None]:
uniq_keyword = list(data['keyword'].unique())
uniq_location = list(data['location'].unique())

In [None]:
for i in range(len(data)):
    if data['keyword'].isnull()[i]:
        for n in data['text'][i]:
            if n in uniq_keyword:
                data['keyword'][i] = n

In [None]:
data.isnull().sum()

In [None]:
for i in range(len(data)):
    if data['location'].isnull()[i]:
        for n in data['text'][i]:
            if n in uniq_location:
                data['location'][i] = n

In [None]:
data.isnull().sum()

In [None]:
data['location'].unique()

In [None]:
data

In [None]:
data['text'] = data['text'].apply(lambda x: ' '.join(x))

In [None]:
data.isnull().sum()

In [None]:
data.loc[data['keyword'].isnull()]

In [None]:
data.columns

In [None]:
data.head()

In [None]:
data['keyword'] = data['keyword'].fillna("None")
data['location'] = data['location'].fillna("None")
#data['words'] = data['words'].fillna("None")


#data.drop(['keyword', 'location', 'words'], axis=1, inplace=True)

In [None]:
data.head()

In [None]:
train = data.loc[data['target'].notnull()]
train

In [None]:
test = data.loc[data['target'].isnull()]
test.drop('target', axis=1, inplace=True)
test

In [None]:
train.shape, test.shape

In [None]:
X = train['text']
y = train['target']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=13)

In [None]:
X_train.shape, y_train.shape

In [None]:
X_train

In [None]:
y_train

In [None]:
sgd = Pipeline([
    ('countVector', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('modelo', SGDClassifier())
])

In [None]:
sgd.fit(X_train, y_train)

In [None]:
sgd_pred = sgd.predict(X_test)

In [None]:
sgd_score = f1_score(y_test, sgd_pred)

In [None]:
sgd_score

In [None]:
pred = sgd.predict(test['text'])

In [None]:
test

In [None]:
submission = pd.DataFrame({'id': test.index, 'target': pred})

In [None]:
submission['target'] = submission['target'].astype('int')

In [None]:
submission.to_csv('submission.csv', index=False)