![](https://datahack-prod.s3.ap-south-1.amazonaws.com/__sized__/contest_cover/sentiments_1920x480-thumbnail-1200x1200-90.jpg)

### **Problem statement**
Sentiment analysis remains one of the key problems that has seen extensive application of natural language processing. This time around, given the tweets from customers about various tech firms who manufacture and sell mobiles, computers, laptops, etc, the task is to identify if the tweets have a negative sentiment towards such companies or products.

In [None]:
import numpy as np 
import pandas as pd 
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout

In [None]:
train = pd.read_csv(r'../input/analytics-vidhya-identify-the-sentiments/train.csv')
test = pd.read_csv(r'../input/analytics-vidhya-identify-the-sentiments/test.csv')
submission = pd.read_csv(r'../input/analytics-vidhya-identify-the-sentiments/sample_submission.csv')

In [None]:
pd.set_option('display.max_colwidth',200)

In [None]:
train.shape, test.shape, submission.shape

In [None]:
train.head()

In [None]:
train.drop('id',axis=1,inplace=True)
test.drop('id',axis=1,inplace=True)

Text is a highly unstructured form of data, various types of noise are present in it and the data is not readily analyzable without any pre-processing. The entire process of cleaning and standardization of text, making it noise-free and ready for analysis is known as text preprocessing. We will divide it into 2 parts:

- Data Inspection
- Data Cleaning

**Data Inspection**

Let’s check out a few positive tweets.

In [None]:
train[train['label']==0].head()

Let’s check out a few negative tweets.

In [None]:
train[train['label']==1].head()

There are quite a many words and characters which are not really required. So, we will try to keep only those words which are important and add value.

Let’s have a glimpse at label-distribution in the train dataset.

In [None]:
train['label'].value_counts()

In the train dataset, we have 2,026 (26%) tweets labeled as negative, and 5,894 (74%) tweets labeled as positive. So, it is an imbalanced classification challenge.

Now we will check the distribution of length of the tweets, in terms of words, in both train and test data.

In [None]:
length_train = train['tweet'].str.len()
length_test = test['tweet'].str.len()
plt.hist(length_train, bins=20,label='train_tweets')
plt.hist(length_test,bins=20,label='test_tweets')
plt.legend()
plt.show()

**Data Cleaning**

In any natural language processing task, cleaning raw text data is an important step. It helps in getting rid of the unwanted words and characters which helps in obtaining better features. If we skip this step then there is a higher chance that you are working with noisy and inconsistent data. The objective of this step is to clean noise those are less relevant to find the sentiment of tweets such as punctuation, special characters, numbers, and terms which don’t carry much weightage in context to the text.
Before we begin cleaning, let’s first combine train and test datasets. Combining the datasets will make it convenient for us to preprocess the data. Later we will split it back into train and test data.

In [None]:
def clean_tweet(text):
    
    # lower-case all characters
    text=text.lower()
    
    # remove twitter handles
    text= re.sub(r'@\S+', '',text) 
    
    # remove urls
    text= re.sub(r'http\S+', '',text) 
    text= re.sub(r'pic.\S+', '',text)
      
    
      
    # regex only keeps characters
    text= re.sub(r"[^a-zA-Z+']", ' ',text)
    
 
    # regex removes repeated spaces, strip removes leading and trailing spaces
    text= re.sub("\s[\s]+", " ",text).strip()  
    
    return text

In [None]:
train['tweet'] =train['tweet'].apply(lambda x: clean_tweet(x)) 
test['tweet'] =test['tweet'].apply(lambda x: clean_tweet(x)) 
train.head()

In [None]:
X = train.drop('label',axis=1)
y = train['label']

In [None]:
### Vocabulary size
voc_size=10000

In [None]:
messages=X.copy()
messages['tweet'][1]

In [None]:
messages.reset_index(inplace=True)

In [None]:
import nltk
from nltk.corpus import stopwords

### Dataset Preprocessing
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
corpus = []
for i in range(0, len(messages)):
    print(i)
    review = re.sub('[^a-zA-Z]', ' ', messages['tweet'][i])
    review = review.lower()
    review = review.split()
    
    review = ' '.join(review)
    corpus.append(review)

In [None]:
from tensorflow.keras.preprocessing.text import one_hot
onehot_repr=[one_hot(words,voc_size)for words in corpus]

In [None]:
sent_length=20
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

In [None]:
embedded_docs[0]

In [None]:
## Creating model
embedding_vector_features=40
model=Sequential()
model.add(Embedding(voc_size,embedding_vector_features,input_length=sent_length))
model.add(Dropout(0.3))
model.add(LSTM(100))
model.add(Dropout(0.3))
model.add(Dense(1,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

In [None]:
len(embedded_docs),y.shape

In [None]:
X_final=np.array(embedded_docs)
y_final=np.array(y)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.33, random_state=42)

#### **Model Training**

In [None]:
### Finally Training
history = model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=15,batch_size=128)

#### **Performance Metrics And Accuracy**

In [None]:
y_pred=model.predict_classes(X_test)

In [None]:
from sklearn.metrics import confusion_matrix
print(f'confusion_matrix: {confusion_matrix(y_test,y_pred)}')

In [None]:
from sklearn.metrics import accuracy_score
print(f'accuracy_score: {accuracy_score(y_test,y_pred)}')

In [None]:
# saving model
filename = 'nlp_model.h5'
model.save(filename)

#### **Predict on test**

In [None]:
Z = test
### Vocabulary size
voc_size=10000

messages=Z.copy()
messages['tweet'][1]

In [None]:
messages.reset_index(inplace=True)

In [None]:
import nltk
from nltk.corpus import stopwords

### Dataset Preprocessing
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
corpus = []
for i in range(0, len(messages)):
    print(i)
    review = re.sub('[^a-zA-Z]', ' ', messages['tweet'][i])
    review = review.lower()
    review = review.split()
    
    review = ' '.join(review)
    corpus.append(review)

In [None]:
from tensorflow.keras.preprocessing.text import one_hot
onehot_repr=[one_hot(words,voc_size)for words in corpus]

In [None]:
sent_length=20
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

In [None]:
pred = model.predict_classes(embedded_docs)

In [None]:
submission['label'] = pred
submission.to_csv(f'submission.csv',index=False)

In [None]:
submission

#### **if you like this notebook plz upvote it**
#### **Thank you**