# SENTIMENT ANALYSIS 

In this hands-on project, we will train a Naive Bayes classifier and logistic regression to predict sentiment from thousands of Twitter tweets. This project could be practically used by any company with social media presence to automatically predict customer's sentiment (i.e.: whether their customers are happy or not). The process could be done automatically without having humans manually review thousands of tweets and customer reviews.

**Now we will perform Twitter sentiment analysis**
* Perform exploratory data analysis and plot word-cloud
* Apply python libraries to import and visualize dataset
* Evaluate the performance of trained Naïve Bayes Classifier model using confusion matrices.
* Train Naïve Bayes classifier models using Scikit-Learn to preform classification
* Understand the difference between prior probability, posterior probability and likelihood.
* Understand the theory and intuition behind Naïve Bayes classifiers
* Perform tokenization to tweet text using Scikit Learn
* Understand the concept of count vectorization (tokenization)
* Perform text data cleaning such as removing punctuation and stop words

#  UNDERSTAND THE PROBLEM STATEMENT AND BUSINESS CASE

Twitter is one of the platforms widely used by people to express their opinions and showcase sentiments on various occasions. Sentiment analysis is an approach to analyze data and retrieve sentiment that it embodies.
The tweet format is very small, which generates a whole new dimension of problems like the use of slang, abbreviations, etc. This article reports on the exploration and preprocessing of data, transforming data into a proper input format and classify the user’s perspective via tweets into positive(non-racist) and negative (racist) by building supervised learning models using Python and NLTK library.

# IMPORT LIBRARIES AND DATASETS

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import seaborn as sns 
import matplotlib.pyplot as plt

In [None]:
twitter_df = pd.read_csv('/kaggle/input/twitter-sentiment-analysis-hatred-speech/train.csv')
twitter_test = pd.read_csv('/kaggle/input/twitter-sentiment-analysis-hatred-speech/test.csv')

# EXPLORE DATASET

In [None]:
twitter_df.head()

In [None]:
twitter_df.describe

In [None]:
twitter_df.info()

In [None]:
twitter_df = twitter_df.drop(['id'],axis = 1)
twitter_test = twitter_test.drop(['id'],axis = 1)

In [None]:
twitter_df

In [None]:
twitter_test

In [None]:
sns.heatmap(twitter_df.isnull(),yticklabels = False,cbar = False , cmap = "Blues")

Here I don't find any null values 

In [None]:
twitter_df.hist(bins = 40,figsize = (14,5),color = 'r')

In [None]:
sns.countplot(twitter_df['label'],label = 'count')

In [None]:
twitter_df['length'] = twitter_df['tweet'].apply(len)

In [None]:
twitter_df

In [None]:
twitter_df['length'].plot(bins = 100,kind = 'hist')

In [None]:
twitter_df.describe

In [None]:
twitter_df.describe()

In [None]:
twitter_df[twitter_df['length']==11]['tweet'].iloc[0]

In [None]:
twitter_df[twitter_df['length']==84]['tweet'].iloc[0]

In [None]:
positive = twitter_df[twitter_df['label']==0]

In [None]:
negative = twitter_df[twitter_df['label']==1]

In [None]:
positive

In [None]:
negative

In [None]:
sentences = twitter_df['tweet'].tolist()

In [None]:
sentences

In [None]:
len(sentences)

In [None]:
sentences_as_one_string = " ".join(sentences)

In [None]:
!pip install Wordcloud

In [None]:
!pip install WordCloud

# PLOT THE WORDCLOUD

In [None]:
from wordcloud import WordCloud

In [None]:
import numpy as np # linear algebra
import pandas as pd 
%pylab
import seaborn as sns 
import matplotlib.pyplot as plt
plt.figure(figsize(20,20))
plt.imshow(WordCloud().generate(sentences_as_one_string))

# PERFORM DATA CLEANING - REMOVE PUNCTUATION FROM TEXT

In [None]:
import string
string.punctuation

In [None]:
Test = 'Good morning beautiful people :)... I am having fun learning Machine learning and AI!!'

In [None]:
Test_punc_rem = [char for char in Test if char not in string.punctuation]

In [None]:
Test_punc_rem

In [None]:
Test_punc_rem_join = ''.join(Test_punc_rem)
Test_punc_rem_join

# PERFORM DATA CLEANING - REMOVE STOPWORDS

In [None]:
import nltk # Natural Language tool kit 

nltk.download('stopwords')

In [None]:
import nltk # Natural Language tool kit 

nltk.download('stopwords')

In [None]:
import re                                  # library for regular expression operations
import string                              # for string operations

from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.stem import PorterStemmer        # module for stemming
from nltk.tokenize import TweetTokenizer 

In [None]:
stopwords.words('english')

In [None]:
Test_punc_rem_join_clean = [word for word in Test_punc_rem_join.split() if word.lower() not in stopwords.words('english')]

In [None]:
Test_punc_rem_join_clean 

# PERFORM COUNT VECTORIZATION (TOKENIZATION)

In [None]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
sample_data = ['This is the first paper.','This document is the second paper.','And this is the third one.','Is this the first paper?']

In [None]:
vectorizer = CountVectorizer()

In [None]:
X = vectorizer.fit_transform(sample_data)

In [None]:
print(X.toarray())

# CREATE A PIPELINE TO REMOVE PUNCTUATIONS, STOPWORDS AND PERFORM COUNT VECTORIZATION

In [None]:
def message_cleaning(message):
    Test_punc_rem = [char for char in Test if char not in string.punctuation]
    Test_punc_rem_join = ''.join(Test_punc_rem)
    Test_punc_rem_join_clean = [word for word in Test_punc_rem_join.split() if word.lower() not in stopwords.words('english')]
    return Test_punc_rem_join_clean

In [None]:
# Let's test the newly added function
twitter_df_clean = twitter_df['tweet'].apply(message_cleaning)

In [None]:
print(twitter_df_clean[5]) 

In [None]:
print(twitter_df['tweet'][5])

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
# Define the cleaning pipeline we defined earlier
vectorizer = CountVectorizer(analyzer = message_cleaning)
twitter_countvectorizer = CountVectorizer(analyzer = message_cleaning,dtype = 'uint8').fit_transform(twitter_df['tweet']).toarray()

In [None]:
twitter_countvectorizer.shape

In [None]:
x = twitter_countvectorizer
y = twitter_df['label']

In [None]:
x.shape

In [None]:
y.shape

# UNDERSTAND THE THEORY AND INTUITION BEHIND NAIVE BAYES

Bayes’ Theorem is a formula that tells us how to update the probabilities of a hypothesis when given an event occurs. In other words it shows the probability of a hypothesis given an event. The image shown above gives a solid summary of Bayes’ formula and each of the components. I am not going to dive deep into Bayes’ theorem as I want to focus on Naive Bayes but Data Skeptic did a great mini-podcast that explains the intuition behind Bayesian updating. I recommend checking it out before moving on to get a better grasp on the concept.


## ****Naive Bayes Classifier****


Now that we have an understanding for the Bayesian framework we can move to Naive Bayes. Naive Bayes is a classification algorithm used for binary or multi-class classification. The classification is carried out by calculating the posterior probabilities and finding the hypothesis with the highest probability using MAP. Basically, it is finding the probability of given feature being associated with a label and assigning the label with the highest probability. It is referred to as naive because it assumes all features are independent, which is rarely the case in real life.

## Things to Remember
* Easy to understand and fast to implement
* Need less training data than logistic regression
* Performs well for categorical input values
* “Zero Frequency” or if a categorical variable has a category in the test set that is not present in the training set, the model will assign a 0% probability to this category making it unable to make a prediction. This can be fixed by using a smoothing method such as Laplace estimation. Laplace estimation assigns a small non-zero probability to data not in the train set. This is extremely relevant for text classification. For example if one word does not appear in the train set you do not want the classifier to lower the probability of the entire document to 0.
* Assumption of independent predictors

# Naive Bayes vs Logistic Regression
Naive Bayes is often compared to another classification algorithm, Logistic Regression. Logistic Regression is a linear classification model that learn the probability of a sample belonging to a class and tries to find the optimal decision boundary that separates the classes.
The main difference between the two is that Naive Bayes is a Generative Model and Logistic Regression is a Discriminative Model. A Generative Model is one that tries to recreate the model that generated the data by estimating the assumptions and distributions of the model. It then uses this to predict the unseen data. For example Naive Bayes models the joint probability of feature X and feature Y and tries to predict the posterior probability based off that model. A Discriminative model is built based only on the observed data and includes less assumptions on the distribution of the data. However, it is very reliant on the quality of the data. For example Logistic Regression directly models posterior probability by learning the input to output mapping by minimizing error.

# Compare and Contrast
*  Naive Bayes assumes all features to be independent so if variables are correlated the predictions will be poor. Logistic Regression is better at handling correlation.
* Naive Bayes works well on small training samples with high dimensionality (given features are independent) as it makes assumptions on prior probabilities. This is why it is commonly used for text classification.
* Logistic Regression works much better than Naive Bayes on large data sets.

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [None]:
from sklearn.naive_bayes import MultinomialNB

NB_classifier = MultinomialNB()
NB_classifier.fit(x_train, y_train)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:

y_predict_test = NB_classifier.predict(x_test)
cm = confusion_matrix(y_test,y_predict_test)



In [None]:
sns.heatmap(cm,annot = True)

In [None]:
print(classification_report(y_test, y_predict_test))

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.35)

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
# Fitting Logistic Regression
log_reg_model = LogisticRegression()
log_reg_model.fit(x_train, y_train)

In [None]:
# Scoring
train_prediction = log_reg_model.predict(x_train)
test_prediction = log_reg_model.predict(x_test)
accuracy_train = accuracy_score(train_prediction, y_train)
accuracy_test = accuracy_score(test_prediction, y_test)



In [None]:
print(f"Score on training set: {accuracy_train}")
print(f"Score on test set: {accuracy_test}")

Here I got almost same result though there are much more techniques this model for learning purpose .