# **Tweets Sentiment Analysis**

![](https://miro.medium.com/max/1000/1*vp1M37AGMOFwCvLxVm62IA.jpeg)

# IMPORTING THE LIBRARIES

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import scipy as sp
import string
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

# LOADING THE DATASET

In [None]:
data = pd.read_csv("../input/tweets-sentiment-analysis/train.csv", encoding='ISO-8859-1')

In [None]:
data

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.value_counts()

In [None]:
data.dtypes

In [None]:
data.shape

In [None]:
data.columns

In [None]:
data.describe().transpose()


In [None]:
data.var()

# **Checking Null Values**

In [None]:
data.isnull().sum()


In [None]:
data.isnull().any()


# **Exploratory Data Analysis**

In [None]:
data.corr()

**HEATMAP**

**A heatmap is a graphical representation of data in two-dimension, using colors to demonstrate different factors. Heatmaps are a helpful visual aid for a viewer, enabling the quick dissemination of statistical or data-driven information.**

In [None]:
plt.figure(figsize = (16,10))

sns.heatmap(data.corr(), annot =True)


**HISTPLOT**

**Histograms represent the data distribution by forming bins along the range of the data and then drawing bars to show the number of observations that fall in each bin. Seaborn comes with some datasets and we have used few datasets in our previous chapters.**


In [None]:
data.hist(figsize=(18,12))
plt.show()


**BARPLOT**

**A barplot (or barchart) is one of the most common types of graphic. It shows the relationship between a numeric and a categoric variable. Each entity of the categoric variable is represented as a bar. The size of the bar represents its numeric value.**


In [None]:
plt.style.use("default")
sns.barplot(x="ItemID", y="SentimentText",data=data[180:190])
plt.title("ItemID vs SentimentText",fontsize=15)
plt.xlabel("ItemID")
plt.ylabel("SentimentText")
plt.show()

**PAIRPLOT**

**pairplot() : To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot() function. This shows the relationship for (n, 2) combination of variable in a DataFrame as a matrix of plots and the diagonal plots are the univariate plots.**

In [None]:
sns.set_palette("Paired")
sns.pairplot(data,hue='Sentiment',height=5.5,palette='colorblind')
plt.show()


In [None]:
data.columns

**BOXPLOT**

**A boxplot is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”). ... It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.**

In [None]:

plt.figure(figsize=(14,10))
sns.set_style(style='darkgrid')
plt.subplot(2,3,1)
sns.boxplot(x='Sentiment',data=data)
plt.subplot(2,3,2)
sns.boxplot(x='ItemID',data=data)


**KDE PLOT**

**Kdeplot is a Kernel Distribution Estimation Plot which depicts the probability density function of the continuous or non-parametric data variables i.e. we can plot for the univariate or multiple variables altogether. Using the Python Seaborn module, we can build the Kdeplot with various functionality added to it.**


In [None]:
plt.style.use("ggplot")
plt.figure(figsize=(12,8))
plt.xlabel('Sentiment')
plt.ylabel('SentimentText')
sns.kdeplot(data['Sentiment'],shade=True,color='blue')
plt.show()


# NLTK

In [None]:
import nltk
import scikitplot as skplt
import re
from nltk.corpus import stopwords
nltk.download('stopwords')
STOPWORDS = stopwords.words('english')


In [None]:
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^0-9a-zA-Z]', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    text = " ".join(word for word in text.split() if word not in STOPWORDS)
    return text


In [None]:
data['clean_text'] = data['SentimentText'].apply(clean_text)
data.head()


In [None]:
X = data['clean_text']
y = data['Sentiment']


In [None]:
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize,word_tokenize
ps=PorterStemmer
words=word_tokenize('clean_text')


In [None]:
#importing the CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer
#lemmatizer=WordNetLemmatizer()


In [None]:
#define a function to get rid of stopwords present in the messages
def message_text_process(mess):
    # Check characters to see if there are punctuations 
    no_punctuation=[char for char in mess if char not in string.punctuation]
    # now form the sentence
    no_punctuation=''.join(no_punctuation)
    # Now eliminate any stopwords
    return[word for word in no_punctuation.split() if word.lower() not in stopwords.words('english')]


In [None]:
data['SentimentText'].head(5).apply(message_text_process)


In [None]:
# start text processing with vectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer

In [None]:
def classify(model, X, y):
    # train test split
    x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, shuffle=True, stratify=y)
    # model training
    pipeline_model = Pipeline([('vect', CountVectorizer()),
                              ('tfidf', TfidfTransformer()),
                              ('clf', model)])
    pipeline_model.fit(x_train, y_train)
    
    print('Accuracy:', pipeline_model.score(x_test, y_test)*100)
    
    print("Training Score:\n",pipeline_model.score(x_train,y_train)*100)


    y_pred = pipeline_model.predict(x_test)
    y_probas =pipeline_model.predict_proba(x_test)
    skplt.metrics.plot_roc(y_test,y_probas,figsize=(10,6),title_fontsize=14,text_fontsize=12)
    plt.show()
    skplt.metrics.plot_precision_recall(y_test,y_probas,figsize=(10,6),title_fontsize=14,text_fontsize=12)
    plt.show()
    skplt.estimators.plot_learning_curve(pipeline_model, X,y,figsize=(10,6),title_fontsize=14,text_fontsize=12)
    plt.show()
    skplt.metrics.plot_lift_curve(y_test,y_probas,figsize=(10,6),title_fontsize=14,text_fontsize=12)
    plt.show()
    skplt.metrics.plot_confusion_matrix(y_test,y_pred,figsize=(10,6),title_fontsize=14,text_fontsize=12,cmap=plt.cm.Pastel1)
    plt.show()
    print(classification_report(y_test, y_pred))


# **MODEL BUILDING**

**LOGISTIC REGRESSION**

**Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).**


In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
classify(model, X, y)


**Decision Tree Classifier**

**Decision Trees are a non-parametric supervised learning method used for both classification and regression tasks. ... Tree models where the target variable can take a discrete set of values are called classification trees.**


In [None]:
from sklearn import tree
tree_clf = tree.DecisionTreeClassifier(max_depth = 2)
classify(tree_clf,X,y)
tree.plot_tree(tree_clf)


**NAIVE BAYES**

**Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems. ... Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building the fast machine learning models that can make quick predictions.**

**In Naive Bayes we can use :**

*** GaussianNB**

*** BernoulliNB**

*** MultinomialNB**




In [None]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()
classify(model, X, y)



**ADA BOOST CLASSIFIER**

**Ada-boost or Adaptive Boosting is one of ensemble boosting classifier proposed by Yoav Freund and Robert Schapire in 1996. It combines multiple classifiers to increase the accuracy of classifiers. ... Any machine learning algorithm can be used as base classifier if it accepts weights on the training set.**

In [None]:
from sklearn.ensemble import AdaBoostClassifier
model= AdaBoostClassifier(base_estimator = None)
classify(model, X, y)


**Conclusion**

**From executing all the algorithms , Naive Bayes got a training accuracy of 88% , then logistic regression with 84% training accuracy score which is quite well for the given dataset**

**Thank You**