<table align="center">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/labrijisaad/Twitter-Sentiment-Analysis-with-Python/blob/main/Twitter%20Sentiment%20Analysis%20with%20Python.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
</table>

<table align="center">
  <td>
    <a target="_blank" href="https://drive.google.com/file/d/19IeqXU96-kDt6wy1wTNyhWrIw1jbK2Kx/view?usp=sharing"><img/>Download the dataset</a>
  </td>
</table>

Notebook realised by  [@labriji_saad](https://github.com/labrijisaad) with the help of [analyticsvidhya](https://www.analyticsvidhya.com/).

## 🔰 `Introduction` :

>**Sentiment analysis** refers to `identifying as well as classifying the sentiments` that are expressed in the text source. Tweets are often useful in generating a vast amount of sentiment data upon analysis. These data are useful in understanding the opinion of the people about a variety of topics.

>Therefore we need to develop an **Automated Machine Learning Sentiment Analysis Model** in order to `compute the customer perception`. Due to the presence of non-useful characters (collectively termed as the noise) along with useful data, it becomes difficult to implement models on them.

## 🎯  `Objective` :

>We aim to analyze the sentiment of the tweets provided from the `Sentiment140 dataset` by developing a **`machine learning pipeline`** involving the use of three classifiers:
> - **`Logistic Regression`**.
> - **`Bernoulli Naive Bayes`**.
> - **`SVM`**.  <br>                                                           
Along with using **`Term Frequency- Inverse Document Frequency (TF-IDF)`**. 

>The **performance** of these classifiers is then **evaluated** using **accuracy**, **ROC-AUC Curve** and **F1 Scores**.

## ❓ `Some explanations` :
> What is **`Logistic Regression ?`** <br>
> - **Logistic regression** is a `statistical analysis method` to predict a binary outcome, such as yes or no, based on prior observations of a data set. It is used in statistical software to `understand the relationship between the dependent variable and one or more independent variables` by estimating probabilities using a logistic regression equation. 
> - **Logistic regression** is `used when your Y variable can take only two values, and if the data is linearly separable`, it is more efficient to classify it into two seperate classes.

> What is **`Bernoulli Naive Bayes ?`** <br>
> - **Bernoulli Naive Bayes** implements the naive Bayes training and classification algorithms for data that is distributed according to multivariate Bernoulli distributions.
> - **Bernoulli Naive Bayes** is one of the variants of the Naive Bayes algorithm in machine learning. It is very useful to be used when the dataset is in a binary distribution where the output label is either present or absent.

> What is **`SVM ?`** <br>
> - **Support Vector Machine** is a supervised machine learning algorithm used for both classification and regression. Though we say regression problems as well its best suited for classification. The objective of SVM algorithm is to find a hyperplane in an N-dimensional space that distinctly classifies the data points.
> - **Support vector machines** are a set of supervised learning methods used for classification, regression and outliers detection. The advantages of support vector machines are: Effective in high dimensional spaces. Still effective in cases where number of dimensions is greater than the number of samples.

> What is **`Term Frequency- Inverse Document Frequency (TF-IDF) ?`** <br>
> - **Term Frequency** gives us information on how often a term appears in a document.
> - **Inverse Document Frequency** is a measure of whether a term is common or rare in a given document corpus.

> What is **`Stemming and lemmatization ?`** <br>
> - The goal of both  **`stemming`** and  **`lemmatization`** is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. 
>- For instance:
>  - **`am`**, **`are`**, **`is`** $\Rightarrow$ **`be`**
>  - **`car`**, **`cars`**, **`car's`**, **`cars'`** $\Rightarrow$ **`car`**
The result of this mapping of text will be something like:
>  - `the boy's cars are different colors` $\Rightarrow$
`the boy car be differ color`



> How do we **`evaluate our model ?`** <br>
> - After training the model we then apply the evaluation measures to check how the model is performing. Accordingly, we use the following evaluation parameters to check the performance of the models respectively :
> - **`Accuracy Score`** : Typically, the accuracy of a predictive model is good (above 90% accuracy)
> - **`Execution time`** : Variable that depends on the machine on which the program was executed, but which can give a small idea of the model that executes the fastest.
> - **`F1 score`** : The F1 score is a weighted harmonic mean of precision and recall such that `the best score is 1.0` and `the worst is 0.0`. **F1 scores are lower than accuracy measures** as they embed precision and recall into their computation.



> - **`ROC-AUC Curve`** : The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.
> - **`Confusion Matrix with Plot`** : A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model. Note that :
    * **`Actual values`** are the columns.
    * **`Predicted values`** are the lines.
><table>
    <tbody>
        <tr>
            <td></td>
            <td><b>Positive</b></td>
            <td><b>Negative</b></td>
        </tr>
        <tr>
            <td><b>Positive</b></td>
            <td>TP</td>
            <td>TN</td>
        </tr>
        <tr>
            <td><b>Negative</b></td>
            <td>FP</td>
            <td>TN</td>
        </tr>
    </tbody>
</table>


> What is **`word_tokenize() ?`** <br>
> - **Tokenization** is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens.
> - **`word_tokenize()`** method. It actually returns the syllables from a single word. A single word can contain one or two syllables. **Return** : Return the list of syllables of words.

## 📚 `Project Pipeline` :
>The various steps involved in the Machine Learning Pipeline are :
> - **1️⃣ `Import Necessary Dependencies`**.

> - **2️⃣ `Read and Load the Dataset`**.

> - **3️⃣ `Exploratory Data Analysis`**.

> - **4️⃣ `Data Visualization of Target Variables`**.

> - **5️⃣ `Data Preprocessing`**.

> - **6️⃣ `Splitting our data into Train and Test Subset`**.

> - **7️⃣ `Transforming Dataset using TF-IDF Vectorizer`**.

> - **8️⃣ `Function for Model Evaluation`**.

> - **9️⃣ `Model Building`**.

> - **1️⃣0️⃣ `Conclusion`**.

### 1️⃣ `Importing the necessary dependencies` :

> Here in this part, we import all the necessary libraries that we will use in our project. The choice of libraries depends on the approach we will follow.

In [None]:
# utilities : 
import re # regular expression library
import numpy as np
import pandas as pd

# plotting :
import seaborn as sns
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# nltk :
import nltk
from nltk.stem import WordNetLemmatizer

# sklearn :
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report

# time library :
import time

from google.colab import files
from os import environ

In [None]:
nltk.__version__

In [None]:
!pip install -q kaggle

In [None]:
uploaded = files.upload()

In [None]:
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json

In [None]:
! kaggle datasets download -d kazanova/sentiment140

In [None]:
! unzip /content/sentiment140.zip 

### 2️⃣ `Reading and Loading the Dataset` :

> In any project related to the manipulation and analysis of data, we always start by collecting the data on which we are going to work. In our case, we will import our data from a `.csv` file.

The various columns present in the dataset are:
- `target`: the polarity of the tweet (positive or negative)
- `ids`: Unique id of the tweet
- `date`: the date of the tweet
- `flag`: It refers to the query. If no such query exists then it is NO QUERY.
- `user`: It refers to the name of the user that tweeted
- `text`: It refers to the text of the tweet

In [None]:
# Importing the dataset :
DATASET_COLUMNS=['target','ids','date','flag','user','text']
DATASET_ENCODING = "ISO-8859-1"
df = pd.read_csv('/content/training.1600000.processed.noemoticon.csv', encoding=DATASET_ENCODING, names=DATASET_COLUMNS)

# Display of the first 5 lines :
df.sample(5)

### 3️⃣ `Exploratory Data Analysis` :
> In this part, the objective is to know the imported data as much as possible, we analyze a sample, we look for the shape of the dataset, the column names, the data type information, we check if there are null values, in short, we process our data and above all we target the data (columns) that interests us.

In [None]:
# Display of the first 5 lines of our dataset :
df.head()

In [None]:
# Display the column names of our dataset :
df.columns

In [None]:
# Display the number of records is our dataset :
print('length of data is', len(df))

In [None]:
df.dtypes

- The data type of some columns in our dataset is `object`, which means we still have to process our data before getting into machine learning stuff.

In [None]:
# Checking for Null values :
np.sum(df.isnull().any(axis=1))

- Good news , there are no missing values in our dataset.

In [None]:
# Rows and columns in the dataset :
print('Count of columns in the data is:  ', len(df.columns))
print('Count of rows in the data is:  ', len(df))

In [None]:
# Checking unique Target Values :
df['target'].unique()

In [None]:
df['target'].nunique()

The **`target`** column is composed of just **0** and **4**
 - **0** stands for `negative` sentiment.
 - **4** stands for `positive` sentiment.

### 4️⃣  `Data Visualization of Target Variables` :
> After processing our data and targeting the columns we are interested in, the next step is to have a visual on our data with mathematical plots, the reason for using plots is that a plots makes the data speak more, so it become more understandable.

In [None]:
df.groupby('target').count()

- Since the **`target`** column only contains **0** or **4**, using the **`.groupby()`** function will result in two categories: **0** and **4**

In [None]:
# Plotting the distribution for dataset :
ax = df.groupby('target').count().plot(kind='bar', title='Distribution of data',legend=False)
# Naming 0 -> Negative , and 4 -> Positive
ax.set_xticklabels(['Negative','Positive'], rotation=0)

# Storing data in lists :
text, sentiment = list(df['text']), list(df['target'])

- Each color represents one of the columns : **`ids`**, **`date`**, **`flag`**, **`user`**	and **`text`**.
- **`text`** variable contains the **`text`** column.
- **`sentiment`** variable contains the **`target`** column.

In [None]:
import seaborn as sns
sns.countplot(x='target', data=df)

- We did the same as before, we just used the **`.countplot()`** function from **`seaborn`**.

### 5️⃣  `Data Preprocessing ` :
> Before training the model, we will perform various pre-processing steps on the dataset such as: 
>- Removing stop words.
>- Removing emojis. 
>- Converting the text document to lowercase for better generalization.
>- Cleaning the ponctuation (to reduce unnecessary noise from the dataset).
>- Removing the repeating characters from the words along with removing the URLs as they do not have any significant importance. <br>                          
and much more, we will see this in detail later...

> We will then performe 
>- **`Stemming`** : reducing the words to their derived stems.
>- **`Lemmatization`** : reducing the derived words to their root form known as lemma for better results.

In [None]:
# Selecting the text and Target column for our further analysis :
data = df[['text','target']]

- **`data`** variable contains the **`target`** and **`text`** columns.

In [None]:
# Replacing the values to ease understanding :
data['target'] = data['target'].replace(4,1)

In [None]:
# Print unique values of target variable :
data['target'].unique()

In [None]:
# Test example 
d_d = {'text': ['I want to hit you with a rock for what you hava just said', 
              'I really appreciate that you shared with me with your thoughts about a situation, so do not worry about that I would never hit you with anything for that genle thing you have made',
              'I like potato and meat',
              'I hate coca-cola',
              'I am sick of travelling alone',
              'Some teachers have maniac idea to give to students as many homework as possible. It is hard to deal with them',
              'Many friends of my girlfriend complain too much about everything, it is so annoying',
              'My friends are too strong and clever they always have a good mood and ready work hard',
              'I am happy to study in this city, it is so enormous and convenient',
              'I love quite places like the morning on a lake'
              ],
       'target': [0, 1, 1, 0,0,0, 0, 1, 1, 1]}
df_d = pd.DataFrame(data=d_d)
print(df_d)
df_d.to_csv('data_example.csv', index=False)

The **`target`** column is composed of just **0** and **1**
 - **0** stands for **`negative`** sentiment.
 - **1** stands for **`positive`** sentiment.

In [None]:
# Separating positive and negative tweets :
data_pos = data[data['target'] == 1]
data_neg = data[data['target'] == 0]

 - The **`data_pos`** variable contains the **`text`** and the **`target = 1`** columns. 
 - The **`data_neg`** variable contains the **`text`** and the **`target = 0`** columns. 

In [None]:
# Combining positive and negative tweets :
dataset = pd.concat([data_pos, data_neg])

- The **`dataset`** variable is a pandas dataframe **(1600000 rows x 2 columns)** that contains the **`text`** and the **`target`** columns. 
- The **800000** first rows are the positive tweets.
- The **800000** second rows are the negative tweets.

In [None]:
# Quick view of how our data looks:
dataset['text'].tail()

In [None]:
# Making statement text in lower case :
dataset['text'] = dataset['text'].str.lower()
dataset['text'].tail()

- **`text`** column is made up of only lowercase characters.

In [None]:
# Defining set containing all stopwords in English :
stopwordlist = ['a', 'about', 'above', 'after', 'again', 'ain', 'all', 'am', 'an',
                'and', 'any', 'are', 'as', 'at', 'be', 'because', 'been', 'before',
                'being', 'below', 'between', 'both', 'by', 'can', 'd', 'did', 'do',
                'does', 'doing', 'down', 'during', 'each', 'few', 'for', 'from',
                'further', 'had', 'has', 'have', 'having', 'he', 'her', 'here',
                'hers', 'herself', 'him', 'himself', 'his', 'how', 'i', 'if', 'in',
                'into', 'is', 'it', 'its', 'itself', 'just', 'll', 'm', 'ma',
                'me', 'more', 'most', 'my', 'myself', 'now', 'o', 'of', 'on', 'once',
                'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'own', 're',
                's', 'same', 'she', "shes", 'should', "shouldve", 'so', 'some', 'such',
                't', 'than', 'that', "thatll", 'the', 'their', 'theirs', 'them',
                'themselves', 'then', 'there', 'these', 'they', 'this', 'those',
                'through', 'to', 'too', 'under', 'until', 'up', 've', 'very', 'was',
                'we', 'were', 'what', 'when', 'where', 'which', 'while', 'who', 'whom',
                'why', 'will', 'with', 'won', 'y', 'you', "youd", "youll", "youre",
                "youve", 'your', 'yours', 'yourself', 'yourselves']


In [None]:
# Cleaning and removing the above stop words list from the tweet text :
STOPWORDS = set(stopwordlist)


def cleaning_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])


dataset['text'] = dataset['text'].apply(lambda text: cleaning_stopwords(text))
dataset['text'].head()

- **`text`** column has been cleaned of stop words.

In [None]:
#  Cleaning and removing punctuations :
import string

english_punctuations = string.punctuation
punctuations_list = english_punctuations


def cleaning_punctuations(text):
    translator = str.maketrans('', '', punctuations_list)
    return text.translate(translator)


dataset['text'] = dataset['text'].apply(lambda x: cleaning_punctuations(x))
dataset['text'].tail()

- **`text`** column has been cleaned of punctuation.

In [None]:
# Cleaning and removing URL’s :
def cleaning_URLs(data):
    return re.sub('((www.[^s]+)|(https?://[^s]+))', ' ', data)


dataset['text'] = dataset['text'].apply(lambda x: cleaning_URLs(x))
dataset['text'].tail()

- **`text`** column has now been cleaned of URLs.

In [None]:
# Cleaning and removing Numeric numbers :
def cleaning_numbers(data):
    return re.sub('[0-9]+', '', data)
dataset['text'] = dataset['text'].apply(lambda x: cleaning_numbers(x))
dataset['text'].tail()

- Column **`text`** has now been cleaned of numeric numbers.

In [None]:
# Getting tokenization of tweet text :
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

dataset['text'] = dataset['text'].apply(word_tokenize)
dataset['text'].head()

- Column **`text`** has now been tokenized.

In [None]:
# Applying Stemming :
st = nltk.PorterStemmer()

def stemming_on_text(data):
    text = [st.stem(word) for word in data]
    return text

dataset['text'] = dataset['text'].apply(lambda x: stemming_on_text(x))
dataset['text'].head()

- **Stemming** has now been applied to the **`text`** column.

In [None]:
# Applying Lemmatizer :
lm = nltk.WordNetLemmatizer()
nltk.download('wordnet')
nltk.download('omw-1.4')
def lemmatizer_on_text(data):
    text = [lm.lemmatize(word) for word in data]
    return text

dataset['text'] = dataset['text'].apply(lambda x: lemmatizer_on_text(x))
dataset['text'].head()

- **Lemmatizer** has now been applied to the **`text`** column.

In [None]:
# Separating input feature and label :
X = data.text
y = data.target

In [None]:
# Plot a cloud of words for negative tweets :
data_neg = data['text'][:800000] # selecting the negative tweets.
plt.figure(figsize=(20, 20))
wc = WordCloud(max_words=1000, width=1600, height=800,
               collocations=False).generate(" ".join(data_neg))
plt.imshow(wc)

- As the picture shows, a lot of negative words appear: bad, sad, wrong..

In [None]:
# Plot a cloud of words for positive tweets :
data_pos = data['text'][800000:]  # selecting the positive tweets.
wc = WordCloud(max_words=1000, width=1600, height=800,
               collocations=False).generate(" ".join(data_pos))
plt.figure(figsize=(20, 20))
plt.imshow(wc)

- As the picture shows, a lot of negative words appear: good, love, happy..

### 6️⃣  `Splitting our data into Train and Test Subset ` :

In [None]:
# Separating the 95% data for training data and 5% for testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=26105111)

- **`random_state`** is basically used for reproducing your problem the same every time it is run. If we do not use a **random_state** in **train_test_split**, every time you make the split we might get a different set of train and test data points and will not help in debugging in case we get an issue.

- **`X`** contains **`data.text`**
- **`y`** contains = **`data.target`**


- **`X_train`** contains **95%** of **`data.text`**
- **`X_test`** contains **5%** of **`data.text`**


- **`y_train`** contains **95%** of **`data.target`**
- **`y_test`** contains **5%** of **`data.target`**

###  7️⃣ `Transforming Dataset using TF-IDF Vectorizer` :
> Scikit-learn's **`Tfidftransformer`** and **`Tfidfvectorizer`** aim to do the same thing, which is to convert a collection of raw documents to a matrix of **TF-IDF features**.

In [None]:
# Fit the TF-IDF Vectorizer :
vectoriser = TfidfVectorizer(ngram_range=(1,2), max_features=500000)
vectoriser.fit(X_train)
print('No. of feature_words: ', len(vectoriser.get_feature_names_out() ))

In [None]:
# Transform the data using TF-IDF Vectorizer :
X_train = vectoriser.transform(X_train)
X_test  = vectoriser.transform(X_test)
X_test

In [None]:
X

In [None]:
print(X_train)

In [None]:
y_train

###  8️⃣ `Function for Model Evaluation` :
> After training the model we then apply the evaluation measures to check how the model is performing. Accordingly, we use the following evaluation parameters to check the performance of the models respectively :
> - **`Accuracy Score`** : Typically, the accuracy of a predictive model is good (above 90% accuracy)
> - **`ROC-AUC Curve`** : The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.
> - **`Confusion Matrix with Plot`** : A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model.
    * **`Actual values`** are the columns.
    * **`Predicted values`** are the lines.
><table>
    <tbody>
        <tr>
            <td></td>
            <td><b>Positive</b></td>
            <td><b>Negative</b></td>
        </tr>
        <tr>
            <td><b>Positive</b></td>
            <td>TP</td>
            <td>TN</td>
        </tr>
        <tr>
            <td><b>Negative</b></td>
            <td>FP</td>
            <td>TN</td>
        </tr>
    </tbody>
</table> 


In [None]:
def model_Evaluate(model):
    # Predict values for Test dataset
    y_pred = model.predict(X_test)
    # Print the evaluation metrics for the dataset.
    print(classification_report(y_test, y_pred))
    # Compute and plot the Confusion matrix
    cf_matrix = confusion_matrix(y_test, y_pred)
    categories = ['Negative','Positive']
    group_names = ['True Neg','False Pos', 'False Neg','True Pos']
    group_percentages = ['{0:.2%}'.format(value) for value in cf_matrix.flatten() / np.sum(cf_matrix)]
    labels = [f'{v1}n{v2}' for v1, v2 in zip(group_names,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    sns.heatmap(cf_matrix, annot = labels, cmap = 'Blues',fmt = '',
    xticklabels = categories, yticklabels = categories)
    plt.xlabel("Predicted values", fontdict = {'size':14}, labelpad = 10)
    plt.ylabel("Actual values" , fontdict = {'size':14}, labelpad = 10)
    plt.title ("Confusion Matrix", fontdict = {'size':18}, pad = 20)

In [None]:
pattern = r'\.?\w+'
s = 'data_example.csv'
find_sp = re.findall(pattern, s)
name, suffix = find_sp[0], find_sp[1]

- To avoid each time and for each model, drawing the confusion matrix, printing the precision, the f1-score... we just define the **`model Evaluate()`** function which will do the job each time.

###  9️⃣ `Model Building` :
> In the problem statement we have used three different models respectively :
>- **`Bernoulli Naive Bayes`**.
>- **`SVM (Support Vector Machine)`**.
>- **`Logistic Regression`**.

>The idea behind choosing these models is that **we want to try all the classifiers on the dataset** ranging from simple models to complex models, and try to **find the one that performs the best**.

In [None]:
# Model-1 : Bernoulli Naive Bayes.
BNBmodel = BernoulliNB()
start = time.time()
BNBmodel.fit(X_train, y_train)
end = time.time()
print("The execution time of this model is {:.2f} seconds\n".format(end-start))
model_Evaluate(BNBmodel)
y_pred1 = BNBmodel.predict(X_test)

- The **`class 0`** is the class of **negative tweets**.
- The **`class 1`** is the class of **positive tweets**.

In [None]:
# Plot the ROC-AUC Curve for model-1 :
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_test, y_pred1)
roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=1, label='ROC curve (area = %0.2f)' % roc_auc)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC CURVE')
plt.legend(loc="lower right")
plt.show()

In [None]:
# Model-2 : SVM (Support Vector Machine).
SVCmodel = LinearSVC()
start = time.time()
SVCmodel.fit(X_train, y_train)
end = time.time()
print("The execution time of this model is {:.2f} seconds\n".format(end-start))
model_Evaluate(SVCmodel)
y_pred2 = SVCmodel.predict(X_test)

- The **`class 0`** is the class of **negative tweets**.
- The **`class 1`** is the class of **positive tweets**.

In [None]:
# Plot the ROC-AUC Curve for model-2 :
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_test, y_pred2)
roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=1, label='ROC curve (area = %0.2f)' % roc_auc)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC CURVE')
plt.legend(loc="lower right")
plt.show()

In [None]:
# Model-3 : Logistic Regression.
LRmodel = LogisticRegression(C = 2, max_iter = 1000, n_jobs=-1)
start = time.time()
LRmodel.fit(X_train, y_train)
end = time.time()
print("The execution time of this model is {:.2f} seconds\n".format(end-start))
model_Evaluate(LRmodel)
y_pred3 = LRmodel.predict(X_test)

- The **`class 0`** is the class of **negative tweets**.
- The **`class 1`** is the class of **positive tweets**.

In [None]:
# Plot the ROC-AUC Curve for model-3 :
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_test, y_pred3)
roc_auc = auc(fpr, tpr)
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=1, label='ROC curve (area = %0.2f)' % roc_auc)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC CURVE')
plt.legend(loc="lower right")
plt.show()

###  1️⃣0️⃣ `Conclusion` :
> After evaluating all models, we can conclude the following details :
><table>
    <thead>
        <tr>
            <th><b>Model</b></th>
            <th><b>Accuracy</b></th>
            <th><b>F1-score ( class 0 )</b></th>
            <th><b>F1-score ( class 1 )</b></th>
            <th><b>AUC Score</b></th>
            <th><b>Execution time</b></th>
        </tr>
        </thead>
    <tbody>
        <tr>
            <td><b>Bernoulli Naive Bayes (BNB)</b></td>
            <td>80%</td> 
            <td>80%</td>
            <td>80%</td>
            <td>80%</td>
            <td>0.69 seconds</td>
        </tr>
        <tr>
            <td><b>Support Vector Machine (SVM)</b></td>
            <td>82%</td> 
            <td>81%</td>
            <td>82%</td>
            <td>82%</td>
            <td>28.32 seconds</td>
        </tr>
        <tr>
            <td><b>Logistic Regression (LR)</b></td>
            <td>83%</td> 
            <td>83%</td>
            <td>83%</td>
            <td>83%</td>
            <td>163.56 seconds</td>
        </tr>
    </tbody>
</table>


 - **`Execution time`** : When it comes to comparing the running time of models, `Bernoulli Naive Bayes` performs faster than `SVM`, which in turn runs faster than `Logistic Regression`.
 - **`Accuracy`** : When it comes to model accuracy, `logistic regression` performs better than `SVM`, which in turn performs better than `Bernoulli Naive Bayes`.
 - **`F1-score`** : The F1 Scores for **class 0** and **class 1** are :
> - For **class 0** (negative tweets) : 
>```
accuracy : BNB (= 0.80) < SVM (=0.81) < LR (= 0.83) 
>``` 
> - For **class 1** (positive tweets) : 
>```
accuracy : BNB (= 0.80) < SVM (=0.82) < LR (= 0.83) 
>```
 - **`AUC Score`** : All three models have the same ROC-AUC score.
>```
AUC score : BNB (= 0.80) < SVM (=0.82) < LR (= 0.83) 
>``` 

- We therefore conclude that **`logistic regression`** is the **best model** for the above dataset.(although it took much longer to run than other models).                           


- In our problem statement, **`logistic regression`** follows **`Occam's razor principle`** which defines that for a particular problem statement, if the data has no assumptions, then the simplest model works best. Since our **dataset has no assumptions** and **logistic regression is a simple model**, so the concept holds true for the dataset mentioned above.

Notebook realised by  [@labriji_saad](https://github.com/labrijisaad) with the help of [analyticsvidhya](https://www.analyticsvidhya.com/).

In [None]:
import pickle
from google.colab import drive
drive.mount('/content/drive')

In [None]:
filename_V = '/content/drive/MyDrive/Big_data/First_lab/Models_weights/Vectoriser.pickle'
pickle.dump(vectoriser, open(filename_V, 'wb'))

In [None]:
filename_LR = '/content/drive/MyDrive/Big_data/First_lab/Models_weights/LR.pickle'
pickle.dump(LRmodel, open(filename_LR, 'wb'))

filename_SVM = '/content/drive/MyDrive/Big_data/First_lab/Models_weights/SVM.pickle'
pickle.dump(SVCmodel, open(filename_SVM, 'wb'))

filename_BNB = '/content/drive/MyDrive/Big_data/First_lab/Models_weights/BNB.pickle'
pickle.dump(BNBmodel, open(filename_BNB, 'wb'))

In [None]:
loaded_model_LR = pickle.load(open(filename_LR, 'rb'))
loaded_model_SVM = pickle.load(open(filename_SVM, 'rb'))
loaded_model_BNB = pickle.load(open(filename_BNB, 'rb'))

In [None]:
filename_X_train = '/content/drive/MyDrive/Big_data/First_lab/data/X_train_vectorised.pickle'
pickle.dump(X_train, open(filename_X_train, 'wb'))

filename_X_test = '/content/drive/MyDrive/Big_data/First_lab/data/X_test_vectorised.pickle'
pickle.dump(X_test, open(filename_X_test, 'wb'))

filename_y_train = '/content/drive/MyDrive/Big_data/First_lab/data/y_train_vectorised.pickle'
pickle.dump(y_train, open(filename_y_train, 'wb'))

filename_y_test = '/content/drive/MyDrive/Big_data/First_lab/data/y_test_vectorised.pickle'
pickle.dump(y_test, open(filename_y_test, 'wb'))

In [None]:
X_train = pickle.load(open(filename_X_train, 'rb'))
X_test = pickle.load(open(filename_X_test, 'rb'))
y_train = pickle.load(open(filename_y_train, 'rb'))
y_test = pickle.load(open(filename_y_test, 'rb'))

In [None]:
BNBmodel = BernoulliNB()
start = time.time()
BNBmodel.fit(X_train, y_train)
end = time.time()
print("The execution time of this model is {:.2f} seconds\n".format(end-start))
model_Evaluate(BNBmodel)
y_pred1 = BNBmodel.predict(X_test)