## Why this notebook?
I saw most of the kernel approached the problem using advanced Deep Learning techniques like **LSTM**, **GRU** and state of the art NLP architecture like **Transformer**. I do agree implementing those methods increases the accuracy and perfomance of the model but for many beginners who have just started their NLP journey, it is quite hard for them to grasp it. So, I have tried to keep this notebook as simple as i could.

I have performed basic text preprocessing techniques and used **Word2Vec** for word embedding. For training i have used RandomForestClassifier.

## importing the basic libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from wordcloud import WordCloud
from gensim.models import Word2Vec

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [None]:
df_train = pd.read_csv("../input/nlp-getting-started/train.csv")
df_test = pd.read_csv("../input/nlp-getting-started/test.csv")

In [None]:
df_train.head()

In [None]:
df_test.head()

## Lets observe the missing values in the dataset

In [None]:
train_na_count = df_train.isna().sum()
sns.barplot(x=train_na_count.values, y=train_na_count.index)

In [None]:
test_na_count = df_test.isna().sum()
sns.barplot(x=test_na_count.values, y=test_na_count.index)

### Both training and test data has many missing values in location columns. Now lets check how balanced the training set is. The plot below shows the data is balanced.

In [None]:
sns.countplot(x=df_train['target'])

In [None]:
def make_wordcloud(text):
    wordcloud = WordCloud(width = 1000, height = 1000,
                background_color ='white',
                stopwords = STOPWORDS,
                min_font_size = 10).generate(" ".join(text.values))
    plt.figure(figsize=(13,13))
    plt.imshow(wordcloud)

### The wordcloud of text before preprocessing shows words like 'https', 'co', 'like', 'new', etc are the most frequently occuring words.

In [None]:
make_wordcloud(df_train['text'])

In [None]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re


wl = WordNetLemmatizer()
STOPWORDS = stopwords.words('english')

In [None]:
# utility function for preprocessing the texts
def preprocess_text(texts):
    corpus = list()
    for text in texts:
        text = re.sub(r'https?://\S+|www\.\S+','',text)    # removing website link if any present in the text 
        text = re.sub(r'[^a-zA-Z]', ' ', text)             # keeping only alphabetic characters
        text = text.lower()
        text = text.split()
        
        text = [wl.lemmatize(word) for word in text if not word in STOPWORDS]  # lemmatizing the words using wordnet lemmatizer
        text = " ".join(text)
        
        corpus.append(text)
    
    return corpus   

In [None]:
df_train['processed_text'] = preprocess_text(df_train['text'])
df_test['processed_text'] = preprocess_text(df_test['text'])

### You can observe how preprocessing the text kept the only important terms of the sentences and excluded all those insignificant terms.

In [None]:
sample_df = df_train.sample(n=20).reset_index(drop=True)
for i in range(10):
    print("-"*100)
    print()
    print(f"BEFORE: {sample_df.loc[i, 'text']}")
    print()
    print(f"AFTER: {sample_df.loc[i, 'processed_text']}")
    print()

In [None]:
make_wordcloud(df_train['processed_text'])

In [None]:
make_wordcloud(df_test['processed_text'])

# Word Embedding

In [None]:
# tokenizing the processed text using word_tokenize of nltk
df_train['tokenized_text'] = df_train['processed_text'].apply(lambda x: word_tokenize(x))
df_test['tokenized_text'] = df_test['processed_text'].apply(lambda x: word_tokenize(x))

In [None]:
corpus = list(df_train['tokenized_text']) + list(df_test['tokenized_text'])

I have used gensim ***Word2Vec*** for embedding the tokens. You can even use the pretrained Word2Vec model, but for the scope of this notebook i preferred to train a new vectorizing model in the provided sentence corpus. If you want to learn more about Word2Vec model, i highly suggest you to read the article on it by [Jay Alammar](https://jalammar.github.io/illustrated-word2vec/).

In [None]:
wv_model = Word2Vec(corpus, vector_size=150, window=3, min_count=2)

wv_model.train(corpus,total_examples=len(corpus),epochs=10)   # training for 10 epochs

In [None]:
# averaging the vectors of each word present in a sentence

vector_list = wv_model.wv.key_to_index
def word_embedding(token_list):
    if len(token_list) < 1:
        return np.random.rand(150)
    else:
        vectorized = [wv_model.wv[word] if word in vector_list else np.random.rand(150) for word in token_list]
    
    sum_vec = np.sum(vectorized,axis=0)
    return sum_vec/len(vectorized)

In [None]:
embedding_train = df_train['tokenized_text'].apply(lambda x: word_embedding(x))
embedding_test = df_test['tokenized_text'].apply(lambda x: word_embedding(x))

embedding_train = np.array([x for x in embedding_train])
embedding_test = np.array([x for x in embedding_test])

In [None]:
X_train, X_val, y_train, y_val = train_test_split(embedding_train, df_train['target'], test_size=0.2, random_state=42) # train-test-split

In [None]:
model1 = RandomForestClassifier()

model1.fit(X_train, y_train)

In [None]:
from sklearn.metrics import accuracy_score

pred = model1.predict(X_val)

print(f"The accuracy score is {accuracy_score(y_val, pred)*100}")  

### The validation accuracy is 75% which is not that bad. Perfoming some hyperparameter optimization would obviously increase the performace, but as i said earlier, i tried to keep this notebook as beginner friendly as i could, so its up to you to tweak some hyperparameters if you wish.

In [None]:
model = RandomForestClassifier()
model.fit(embedding_train, df_train['target'])

In [None]:
final_predict = model.predict(embedding_test)

In [None]:
df_submission = pd.DataFrame({'id':df_test['id'].values, 'target':final_predict})
df_submission.head()

In [None]:
df_submission.to_csv('submission.csv', index=False)

## At last, if you have good knowledge of Deep Learning, do go for the advance approach, but if you are beginners you can play with this notebook to further improve the accuracy.
## Happy Learning!! :D