![](https://www.kdnuggets.com/images/sentiment-fig-1-689.jpg)

# Tweet Sentiment Analysis

Sentiment analysis is the interpretation and classification of emotions (positive, negative and neutral) within text data using text analysis techniques. Sentiment analysis allows businesses to identify customer sentiment toward products, brands or services in online conversations and feedback.

# Why Perform Sentiment Analysis?

It’s estimated that 80% of the world’s data is unstructured, in other words it’s unorganized. Huge volumes of text data (emails, support tickets, chats, social media conversations, surveys, articles, documents, etc), is created every day but it’s hard to analyze, understand, and sort through, not to mention time-consuming and expensive.

Sentiment analysis, however, helps businesses make sense of all this unstructured text by automatically tagging it.

"My ridiculous dog is amazing." [sentiment: positive]

With all of the tweets circulating every second it is hard to tell whether the sentiment behind a specific tweet will impact a company, or a person's, brand for being viral (positive), or devastate profit because it strikes a negative tone. Capturing sentiment in language is important in these times where decisions and reactions are created and updated in seconds. But, which words actually lead to the sentiment description? In this competition you will need to pick out the part of the tweet (word or phrase) that reflects the sentiment.

# Content:

1. Load and Check Data
2. Variable Description
3. Univariate Variable Analysis
4. Text Length Distribution
5. Selected Text Length Distribution
6. Basic Data Analysis
7. Cleaning the Data
    *     Removing square brackets, links, punctuations etc.
    *     Removing the stopwords
    *     Lemmatization
8. N-Gram Modelling
9. Most Common Words Analysis
    *     In "Selected Text"
    *     In "Text"
10. Most common words Sentiments Wise
    *     Most 25 common positive words
    *     Most 25 common negative words
    *     Most 25 common neutral words
11. Unique Words in each Segment
    *     Unique 10 Positive words
    *     Unique 10 Negative words
    *     Unique 10 Neutral words
12. Modeling With Jaccard Scores Over 0.2
    *     Naive Bayes
    *     Logistic Regression
    *     Decision Tree
    *     Random Forest
    *     K-Nearest Neighbour
    *     Support Vector Machine
    *     LightGBM
13. Modeling With Jaccard Scores Over 0.8
    *     Naive Bayes
    *     Logistic Regression
    *     Decision Tree
    *     Random Forest
    *     K-Nearest Neighbour
    *     Support Vector Machine
    *     LightGBM

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import re
import string
import numpy as np 
import random
import pandas as pd 

%matplotlib inline
from plotly import graph_objs as go
import plotly.express as px
import plotly.figure_factory as ff
from collections import Counter

from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator


import nltk
from nltk.corpus import stopwords
import nltk as nlp

from tqdm import tqdm
import os





import warnings
warnings.filterwarnings("ignore")


import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")

import seaborn as sns

from collections import Counter

import warnings
warnings.filterwarnings("ignore")
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Load and Check Data


In [None]:
train_df = pd.read_csv("/kaggle/input/tweet-sentiment-extraction/train.csv")
test_df = pd.read_csv("/kaggle/input/tweet-sentiment-extraction/test.csv")

train_df.columns

In [None]:
train_df.head()

In [None]:
train_df.describe()

# Variable Description
1. textID: unique ID for each piece of text
2. text: the text of the tweet
3. selected_text: the general sentiment of the tweet
4. sentiment: [train only] the text that supports the tweet's sentiment; Positive, Negative, Neutral


In [None]:
print(train_df.shape)
print(test_df.shape)

In [None]:
train_df.info()

We have one null Value in the train , as the test field for value is NAN we will just remove it



In [None]:
train_df.dropna(inplace=True)

# Univariate Variable Analysis
Categorical Variable: textID, text, selected_text  , sentiment

**Categorical Variable**

Lets look at the distribution of tweets in the train set

In [None]:
temp = train_df.groupby('sentiment').count()['text'].reset_index().sort_values(by='text',ascending=False)
temp.style.background_gradient(cmap='Reds')

In [None]:
def bar_plot(variable):
   
    # get feature
    var = train_df[variable]
    # count number of categorical variable(value/sample)
    varValue = var.value_counts()
    
    # visualize
    plt.figure(figsize = (9,3))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    print("{}: \n {}".format(variable,varValue))

In [None]:
plt.figure(figsize=(10,8))
sns.countplot(x='sentiment',data=train_df)

In [None]:
fig = go.Figure(go.Funnelarea(
    text =temp.sentiment,
    values = temp.text,
    title = {"position": "top center", "text": "Funnel-Chart of Sentiment Distribution"}
    ))
fig.show()

# Text Length Distribution

In [None]:
lens = [len(x) for x in train_df.text]
plt.figure(figsize=(12, 5));

print ("Max length:", max(lens))
print ("Min length:", min(lens))
print ("Mean length:", np.mean(lens))

sns.distplot(lens);
plt.title('Text length distribution')

# Selected Text Length Distribution

In [None]:
lens = [len(x) for x in train_df.selected_text]
plt.figure(figsize=(12, 5));
print ("Max length:", max(lens))
print ("Min length:", min(lens))
print ("Mean length:", np.mean(lens))
sns.distplot(lens);
plt.title('Text length distribution')

# Basic Data Analysis


In [None]:
def jaccard(str1, str2): 
    a = set(str1.lower().split()) 
    b = set(str2.lower().split())
    c = a.intersection(b)
    return float(len(c)) / (len(a) + len(b) - len(c))

In [None]:
results_jaccard=[]

for ind,row in train_df.iterrows():
    sentence1 = row.text
    sentence2 = row.selected_text

    jaccard_score = jaccard(sentence1,sentence2)
    results_jaccard.append([sentence1,sentence2,jaccard_score])

In [None]:
jaccard = pd.DataFrame(results_jaccard,columns=["text","selected_text","jaccard_score"])
train_df = train_df.merge(jaccard,how='outer')

In [None]:
train_df['Num_words_ST'] = train_df['selected_text'].apply(lambda x:len(str(x).split())) #Number Of words in Selected Text
train_df['Num_word_text'] = train_df['text'].apply(lambda x:len(str(x).split())) #Number Of words in main text
train_df['difference_in_words'] = train_df['Num_word_text'] - train_df['Num_words_ST'] #Difference in Number of words text and Selected Text

In [None]:
train_df.head() 

In [None]:
#Duygulara gore Jaccard scrore ortalama degerleri

In [None]:
train_df.groupby('sentiment').mean()['jaccard_score']

* positive sonucunu veren tweetler ortalama olarak %31 oraninda selected text olarak kaydedilmis. yani textlerin ortalama %69u elenmis.
* negative sonucunu veren tweetler ortalama olarak %33u oraninda selected text olarak kaydedilmis. yani textlerin ortalama %67si elenmis.
* neutral sonucunu veren tweetler ortalama olarak %97si oraninda selected text olarak kaydedilmis. yani textlerin ortalama %3u elenmis.


# Cleaning the Data
Now Before We Dive into extracting information out of words in text and selected text,let's first clean the data

In [None]:
def clean_text(text):
    '''Make text lowercase, remove text in square brackets,remove links,remove punctuation
    and remove words containing numbers.'''
    text = str(text).lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('https?://\S+|www\.\S+', '', text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\n', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

In [None]:
train_df['text'] = train_df['text'].apply(lambda x:clean_text(x))
train_df['selected_text'] = train_df['selected_text'].apply(lambda x:clean_text(x))

> Sentiment names converted to numeric

In [None]:
train_df['sentiment'] = train_df['sentiment'].map({'positive': 1, 'negative': 2, 'neutral':0})


In [None]:
train_df.head()

# **Removing the stopwords**

In [None]:
def remove_stopword(x):
    return [y for y in x if y not in stopwords.words('english')]


In [None]:
#remove stopwords - selected text

train_df['selected_text_clear'] = train_df['selected_text'].apply(lambda x:str(x).split())

train_df['selected_text_clear'] = train_df['selected_text_clear'].apply(lambda x:remove_stopword(x))

In [None]:
#remove stopwords - text

train_df['text_clear'] = train_df['text'].apply(lambda x:str(x).split())

train_df['text_clear'] = train_df['text_clear'].apply(lambda x:remove_stopword(x))

# Lemmatization

In [None]:
lemma = nlp.WordNetLemmatizer()

def lemmatizate_word(x):
    return [lemma.lemmatize(word) for word in x]

train_df['selected_text_clear'] = train_df['selected_text_clear'].apply(lambda x:lemmatizate_word(x)) #selected text
train_df['text_clear'] = train_df['text_clear'].apply(lambda x:lemmatizate_word(x)) #text

# N-Gram Modelling

In [None]:
def ngram(text):    
    return [(text[i],text[i+1]) for i in range(0,len(text)-1)]

train_df['ngram_text'] = train_df['text_clear'].apply(lambda x:str(x).split())
ngram_list = []

    
train_df['ngram_text'] = train_df['ngram_text'].apply(lambda ngram_list:ngram(ngram_list))


In [None]:
train_df.ngram_text

In [None]:
train_df.head()

# Most Common words "Selected Text"

In [None]:
top = Counter([item for sublist in train_df['selected_text_clear'] for item in sublist])
temp = pd.DataFrame(top.most_common(25))
temp = temp.iloc[1:,:]
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Blues')


In [None]:
fig = px.bar(temp, x="count", y="Common_words", title='Commmon Words in Selected Text', orientation='h', 
             width=700, height=700,color='Common_words')
fig.show()

#  Most Common words in "Text"

In [None]:
top = Counter([item for sublist in train_df['text_clear'] for item in sublist])
temp = pd.DataFrame(top.most_common(25))
temp = temp.iloc[1:,:]
temp.columns = ['Common_words','count']
temp.style.background_gradient(cmap='Blues')

In [None]:
fig = px.bar(temp, x="count", y="Common_words", title='Commmon Words in Text', orientation='h', 
             width=700, height=700,color='Common_words')
fig.show()

# Most common words Sentiments Wise
Let's look at the most common words in different sentiments

Most 25 common ***positive*** words in Selected Texts

In [None]:
Positive_sent = train_df[(train_df['sentiment']== 1) ]

top = Counter([item for sublist in Positive_sent['text_clear'] for item in sublist])
temp_positive = pd.DataFrame(top.most_common(25))
temp_positive.columns = ['Common_words','count']
temp_positive.style.background_gradient(cmap='Greens')

Most 25 common ***negative*** words in Selected Texts

In [None]:
Negative_sent = train_df[(train_df['sentiment']== 2) ]

top = Counter([item for sublist in Negative_sent['text_clear'] for item in sublist])
temp_negative = pd.DataFrame(top.most_common(25))
temp_negative = temp_negative.iloc[1:,:] #except 'im'
temp_negative.columns = ['Common_words','count']
temp_negative.style.background_gradient(cmap='Reds')

Most 25 common ***neutral*** words in Selected Texts

In [None]:
Neutral_sent = train_df[(train_df['sentiment']== 0) ]

top = Counter([item for sublist in Neutral_sent['text_clear'] for item in sublist])
temp_neutral = pd.DataFrame(top.most_common(25))
temp_neutral = temp_neutral.loc[1:,:] #except 'im'
temp_neutral.columns = ['Common_words','count']
temp_neutral.style.background_gradient(cmap='Greys')

# Unique Words in each Segment
We will look at unique words in each segment in the Following Order:

In [None]:
raw_text = [word for word_list in train_df['selected_text_clear'] for word in word_list]


In [None]:
def words_unique(sentiment,numwords,raw_words):
    '''
    Input:
        segment - Segment category (ex. 'Neutral');
        numwords - how many specific words do you want to see in the final result; 
        raw_words - list  for item in train_data[train_data.segments == segments]['temp_list1']:
    Output: 
        dataframe giving information about the name of the specific ingredient and how many times it occurs in the chosen cuisine (in descending order based on their counts)..
    '''
    allother = []
    for item in train_df[(train_df.sentiment != sentiment)]['selected_text_clear']:
        for word in item:
            allother.append(word)
    allother = list(set(allother ))
    
    specificnonly = [x for x in raw_text if x not in allother]
    
    mycounter = Counter()
    
    for item in train_df[(train_df.sentiment == sentiment) ]['selected_text_clear']:
        for word in item:
            mycounter[word] += 1
    keep = list(specificnonly)
    
    for word in list(mycounter):
        if word not in keep:
            del mycounter[word]
    
    Unique_words = pd.DataFrame(mycounter.most_common(numwords), columns = ['words','count'])
    
    return Unique_words

**The top 10 unique words in Positive Tweets are:**

In [None]:
Unique_Positive= words_unique(1, 10, raw_text)
print("The top 10 unique words in Positive Tweets are:")
Unique_Positive.style.background_gradient(cmap='Greens')

The top 100 unique words in Negative Tweets are:

In [None]:
Unique_Negative= words_unique(2, 10, raw_text)
print("The top 10 unique words in Negative Tweets are:")
Unique_Negative.style.background_gradient(cmap='Reds')

The top 100 unique words in Neutral Tweets are:

In [None]:
Unique_Neutral= words_unique(0, 10, raw_text)
print("The top 10 unique words in Neutral Tweets are:")
Unique_Neutral.style.background_gradient(cmap='Greys')


# Up to now, we have analyzed the dataset.

# From now on we will focus on modeling and handle 7 different machine learning methods:
#     * Naive Bayes
#     * Logistic Regression
#     * Decision Tree
#     * Random Forest
#     * K-Nearest Neighbour
#     * Support Vector Machine
#     * LightGBM
# These methods are preferred taking into account the opinions and tips of leading professionals.

# In order to achieve better results we have focused on Jaccard score and we have implemented 7 different machine learning methods taking into account 2 different jaccard scores: 0.2 and 0.8  

# Modeling with jaccard scores over 0.2

In [None]:
train_df2 = train_df[train_df['jaccard_score'] > 0.2]
train_df2.head()

In [None]:
temp = train_df2.groupby('sentiment').count()['text'].reset_index().sort_values(by='text',ascending=False)
temp.style.background_gradient(cmap='Reds')

Bag of Words

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms. The approach is very simple and flexible. In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity.

In [None]:
selected_text_listt = []
for i in train_df2['selected_text_clear']:
    i = ' '.join(i)
    selected_text_listt.append(i)
    
from sklearn.feature_extraction.text import CountVectorizer 
max_features =500

count_vectorizer = CountVectorizer(max_features=max_features,stop_words = "english")

sparce_matrix = count_vectorizer.fit_transform(selected_text_listt).toarray()  


Scikit-learn’s CountVectorizer is used to convert a collection of text documents to a vector of term/token counts. It also enables the ​pre-processing of text data prior to generating the vector representation. This functionality makes it a highly flexible feature representation module for text.

In [None]:
y = train_df2.iloc[:,3:4].values     # sentiment
x = sparce_matrix
# train test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state = 42)

In [None]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

StandardScaler standardizes a feature by subtracting the mean and then scaling to unit variance. Unit variance means dividing all the values by the standard deviation.

# There are three important terms that need to be known well in order to understand machine learning models. These terms are:
#     * Classification report
#     * Accuracy score
#     * Confusion matrix
# A general overview of these terms can be found below.

Classification report is used to measure the quality of predictions from a classification algorithm. How many predictions are True and how many are False. More specifically, True Positives, False Positives, True negatives and False Negatives are used to predict the metrics of a classification report. 

The report shows the main classification metrics precision, recall and f1-score on a per-class basis.

The metrics are calculated by using true and false positives, true and false negatives. Positive and negative in this case are generic names for the predicted classes. There are four ways to check if the predictions are right or wrong:

    TN / True Negative: when a case was negative and predicted negative
    TP / True Positive: when a case was positive and predicted positive
    FN / False Negative: when a case was positive but predicted negative
    FP / False Positive: when a case was negative but predicted positive

**Precision** – What percent of your predictions were correct?

Precision is the ability of a classifier not to label an instance positive that is actually negative. For each class it is defined as the ratio of true positives to the sum of true and false positives.

TP – True Positives
FP – False Positives

Precision – Accuracy of positive predictions.
Precision = TP/(TP + FP)


**Recall** – What percent of the positive cases did you catch? 

Recall is the ability of a classifier to find all positive instances. For each class it is defined as the ratio of true positives to the sum of true positives and false negatives.

FN – False Negatives

Recall: Fraction of positives that were correctly identified.
Recall = TP/(TP+FN)


**F1 score** – What percent of positive predictions were correct? 

The F1 score is a weighted harmonic mean of precision and recall such that the best score is 1.0 and the worst is 0.0. Generally speaking, F1 scores are lower than accuracy measures as they embed precision and recall into their computation. As a rule of thumb, the weighted average of F1 should be used to compare classifier models, not global accuracy.

F1 Score = 2*(Recall * Precision) / (Recall + Precision)

Accuracy score: In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true.

Confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix). Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class (or vice versa). The name stems from the fact that it makes it easy to see if the system is confusing two classes (i.e. commonly mislabeling one as another). 

# Naive Bayes

> Naive Bayes is a classification algorithm for binary (two-class) and multiclass classification problems. It is called Naive Bayes or idiot Bayes because the calculations of the probabilities for each class are simplified to make their calculations tractable.
> 
> Rather than attempting to calculate the probabilities of each attribute value, they are assumed to be conditionally independent given the class value.
> 
> This is a very strong assumption that is most unlikely in real data, i.e. that the attributes do not interact. Nevertheless, the approach performs surprisingly well on data where this assumption does not hold.

In [None]:
# %% naive bayes
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(x_train,y_train)

from sklearn.metrics import *
# Predicting the Test set results
y_pred = nb.predict(x_test)


print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

nb_02_accuracy = accuracy_score(y_test, y_pred)

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test, y_pred)
sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False,
            xticklabels='', yticklabels='')
plt.xlabel('true label')
plt.ylabel('predicted label');

In [None]:
# from sklearn.naive_bayes import MultinomialNB
# nb = MultinomialNB()
# nb.fit(x_train,y_train)

# from sklearn.metrics import *
# # Predicting the Test set results
# y_pred = nb.predict(x_test)


# print(classification_report(y_test, y_pred))
# print(confusion_matrix(y_test, y_pred))
# print(accuracy_score(y_test, y_pred))

# Logistic Regression

> Logistic regression is a classification algorithm used to assign observations to a discrete set of classes. Unlike linear regression which outputs continuous number values, logistic regression transforms its output using the logistic sigmoid function to return a probability value which can then be mapped to two or more discrete classes.

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 42)
classifier.fit(x_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(x_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

lr_02_accuracy = accuracy_score(y_test, y_pred)

Confusion Matrix 

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test, y_pred)
sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False,
            xticklabels='', yticklabels='')
plt.xlabel('true label')
plt.ylabel('predicted label');

# Decission Tree

> Decision tree is the most powerful and popular tool for classification and prediction. A Decision tree is a flowchart like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label.

In [None]:
# Fitting Decision Tree Classification to the Training set
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(x_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

dt_02_accuracy = accuracy_score(y_test, y_pred)

Confusion Matrix

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test, y_pred)
sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False,
            xticklabels='', yticklabels='')
plt.xlabel('true label')
plt.ylabel('predicted label');

# RandomForest

> Random forests is a supervised learning algorithm. It can be used both for classification and regression. It is also the most flexible and easy to use algorithm. A forest is comprised of trees. It is said that the more trees it has, the more robust a forest is. Random forests creates decision trees on randomly selected data samples, gets prediction from each tree and selects the best solution by means of voting. It also provides a pretty good indicator of the feature importance.

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=100, criterion = 'entropy', random_state = 0)
classifier.fit(x_train, y_train)

y_pred = classifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

rf_02_accuracy = accuracy_score(y_test, y_pred)

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test, y_pred)
sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False,
            xticklabels='', yticklabels='')
plt.xlabel('true label')
plt.ylabel('predicted label');

# K-Nearest Neighbour

> K Nearest Neighbor(KNN) is a very simple, easy to understand, versatile and one of the topmost machine learning algorithms. KNN used in the variety of applications such as finance, healthcare, political science, handwriting detection, image recognition and video recognition. In Credit ratings, financial institutes will predict the credit rating of customers. In loan disbursement, banking institutes will predict whether the loan is safe or risky. In political science, classifying potential voters in two classes will vote or won’t vote. KNN algorithm used for both classification and regression problems. 

In [None]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 3, metric = 'minkowski', p = 2)
classifier.fit(x_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

knn_02_accuracy = accuracy_score(y_test, y_pred)

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test, y_pred)
sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False,
            xticklabels='', yticklabels='')
plt.xlabel('true label')
plt.ylabel('predicted label');

# Support Vector Machine

> SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible.
In addition to performing linear classification, SVMs can efficiently perform a non-linear classification, implicitly mapping their inputs into high-dimensional feature spaces.

In [None]:
# LOAD LIBRARIES
from sklearn.svm import SVC
clf = SVC(probability=True,kernel='poly',degree=4,gamma='auto')
clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

svm_02_accuracy = accuracy_score(y_test, y_pred)

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test, y_pred)
sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False,
            xticklabels='', yticklabels='')
plt.xlabel('true label')
plt.ylabel('predicted label');

# LightGBM

> LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed and efficient with the following advantages:
> 
> * Faster training speed and higher efficiency.
>  
> * Lower memory usage.
>  
> * Better accuracy.
>  
> * Support of parallel and GPU learning.
>  
> * Capable of handling large-scale data.

In [None]:
from lightgbm import LGBMClassifier
lgbm_model = LGBMClassifier().fit(x_train, y_train)

# Predicting the Test set results
y_pred = lgbm_model.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

lgbm_02_accuracy = accuracy_score(y_test, y_pred)

# Modeling with jaccard scores over 0.8

In [None]:
train_df8 = train_df[train_df['jaccard_score'] > 0.8]

Bag of Words

In [None]:
selected_text_listt = []
for i in train_df8['selected_text_clear']:
    i = ' '.join(i)
    selected_text_listt.append(i)
    
from sklearn.feature_extraction.text import CountVectorizer 
max_features = 500

count_vectorizer = CountVectorizer(max_features=max_features,stop_words = "english")

sparce_matrix = count_vectorizer.fit_transform(selected_text_listt).toarray()  


In [None]:
y = train_df8.iloc[:,3:4].values     # sentiment
x = sparce_matrix
# train test split
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state = 42)


# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

# Naive Bayes

In [None]:
# %% naive bayes
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(x_train,y_train)

from sklearn.metrics import *
# Predicting the Test set results
y_pred = nb.predict(x_test)


print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

nb_08_accuracy = accuracy_score(y_test, y_pred)

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test, y_pred)
sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False,
            xticklabels='', yticklabels='')
plt.xlabel('true label')
plt.ylabel('predicted label');

# Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 42)
classifier.fit(x_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

lr_08_accuracy = accuracy_score(y_test, y_pred)

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test, y_pred)
sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False,
            xticklabels='', yticklabels='')
plt.xlabel('true label')
plt.ylabel('predicted label');

# Decision Tree

In [None]:
# Fitting Decision Tree Classification to the Training set
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(x_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

dt_08_accuracy = accuracy_score(y_test, y_pred)

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test, y_pred)
sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False,
            xticklabels='', yticklabels='')
plt.xlabel('true label')
plt.ylabel('predicted label');

# Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=100, criterion = 'entropy', random_state = 0)
classifier.fit(x_train, y_train)

y_pred = classifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

rf_08_accuracy = accuracy_score(y_test, y_pred)

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test, y_pred)
sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False,
            xticklabels='', yticklabels='')
plt.xlabel('true label')
plt.ylabel('predicted label');

# K-Nearest Neighbour

In [None]:
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors = 3, metric = 'minkowski', p = 2)
classifier.fit(x_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

knn_08_accuracy = accuracy_score(y_test, y_pred)

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test, y_pred)
sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False,
            xticklabels='', yticklabels='')
plt.xlabel('true label')
plt.ylabel('predicted label');

# Support Vector Machine

In [None]:
# LOAD LIBRARIES
from sklearn.svm import SVC
clf = SVC(probability=True,kernel='poly',degree=4,gamma='auto')
clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

svm_08_accuracy = accuracy_score(y_test, y_pred)

In [None]:
from sklearn.metrics import confusion_matrix
mat = confusion_matrix(y_test, y_pred)
sns.heatmap(mat, square=True, annot=True, fmt='d', cbar=False,
            xticklabels='', yticklabels='')
plt.xlabel('true label')
plt.ylabel('predicted label');

# LightGBM 

In [None]:
from lightgbm import LGBMClassifier
lgbm_model = LGBMClassifier().fit(x_train, y_train)

# Predicting the Test set results
y_pred = lgbm_model.predict(x_test)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

lgbm_08_accuracy = accuracy_score(y_test, y_pred)

In [None]:
df_accuracies = [lr_02_accuracy,nb_02_accuracy,dt_02_accuracy,rf_02_accuracy,knn_02_accuracy,svm_02_accuracy,lgbm_02_accuracy,lr_08_accuracy,nb_08_accuracy,dt_08_accuracy,rf_08_accuracy,knn_08_accuracy,svm_08_accuracy,lgbm_08_accuracy]


In [None]:
df_accuracies = pd.DataFrame(data = df_accuracies, index=range(len(df_accuracies)),columns=['accuracy'])
df_accuracies['model_name'] = ['logistic regression 02','naive bayes 02','desicion tree 02','random forest 02','knn 02','svm 02','lightgbm 02','logistic regression 08','naive bayes 08','desicion tree 08','random forest 08','knn 08','svm 08','lightgbm 08']

In [None]:
df_accuracies.head(12)


In [None]:
df_accuracies.plot(kind='bar',x='model_name',y='accuracy',figsize=(15,10))
