## Project Description: Twitter US Airline Sentiment
### Data Description:
A sentiment analysis job about the problems of each major U.S. airline. Twitter data was scraped from February of 2015 and contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service").

#### Dataset:
The project is from a dataset from Kaggle.
Link to the Kaggle project site: https://www.kaggle.com/crowdflower/twitter-airline-sentiment
The dataset has to be downloaded from the above Kaggle website.

The dataset has the following columns:

        1 tweet_id
        2 airline_sentiment
        3 airline_sentiment_confidence
        4 negativereason
        5 negativereason_confidence
        6 airline
        7 airline_sentiment_gold
        8 name
        9 negativereason_gold
        10 retweet_count
        11 text
        12 tweet_coord
        13 tweet_created
        14 tweet_location
        15 user_timezone

#### Objective:
To implement the techniques learnt as a part of the course.
##### Learning Outcomes:
    1 Basic understanding of text pre-processing.
    2 What to do after text pre-processing:
    3 Bag of words
    4 Tf-idf
    5 Build the classification model.
    6 Evaluate the Model.

#### Steps and tasks:
1. Import the libraries, load dataset, print shape of data, data description. **(5 Marks)**

2. Understand of data-columns: **(5 Marks)**
    a. Drop all other columns except “text” and “airline_sentiment”.
    b. Check the shape of data.
    c. Print first 5 rows of data.

3. Text pre-processing: Data preparation. **(20 Marks)**
    
    a. Html tag removal
    b. Tokenization.
    c. Remove the numbers.
    d. Removal of Special Characters and Punctuations.
    e. Conversion to lowercase.
    f. Lemmatize or stemming.
    g. Join the words in the list to convert back to text string in the dataframe. (So that each row contains the data in text format.)
    h. Print first 5 rows of data after pre-processing.

4. Vectorization: **(10 Marks)**
   
    a. Use CountVectorizer.
    b. Use TfidfVectorizer.
    
5. Fit and evaluate model using both type of vectorization. **(6+6 Marks)**

6. Summarize your understanding of the application of Various Pre-processing and Vectorization and performance of your model on this dataset. **(8 Marks)**



In [99]:
!pip install spacy
!python -m spacy download en_core_web_sm

[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


2021-03-10 11:20:52.845676: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2021-03-10 11:20:52.845708: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


1.Import the libraries, load dataset, print shape of data, data description. **(5 Marks)**

In [100]:
import math
import pandas as pd
import numpy as np
from glob import glob
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, precision_recall_curve, auc, classification_report
import matplotlib.pyplot as plt

from bs4 import BeautifulSoup
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
import unicodedata
import re
import spacy
from nltk.corpus import stopwords
nltk.download('stopwords')
nlp = spacy.load('en_core_web_sm')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\yoges\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [101]:
data = pd.read_csv('Tweets.csv');

In [102]:
data.shape

(14640, 15)

In [103]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   tweet_id                      14640 non-null  int64  
 1   airline_sentiment             14640 non-null  object 
 2   airline_sentiment_confidence  14640 non-null  float64
 3   negativereason                9178 non-null   object 
 4   negativereason_confidence     10522 non-null  float64
 5   airline                       14640 non-null  object 
 6   airline_sentiment_gold        40 non-null     object 
 7   name                          14640 non-null  object 
 8   negativereason_gold           32 non-null     object 
 9   retweet_count                 14640 non-null  int64  
 10  text                          14640 non-null  object 
 11  tweet_coord                   1019 non-null   object 
 12  tweet_created                 14640 non-null  object 
 13  t

2. Understand of data-columns: **(5 Marks)**
    
    a. Drop all other columns except “text” and “airline_sentiment”. 
    
    b. Check the shape of data. 
    
    c. Print first 5 rows of data.

In [104]:
data = data[['text', 'airline_sentiment']]

In [105]:
data.shape

(14640, 2)

In [106]:
data.describe()

Unnamed: 0,text,airline_sentiment
count,14640,14640
unique,14427,3
top,@united thanks,negative
freq,6,9178


In [107]:
data.head()

Unnamed: 0,text,airline_sentiment
0,@VirginAmerica What @dhepburn said.,neutral
1,@VirginAmerica plus you've added commercials t...,positive
2,@VirginAmerica I didn't today... Must mean I n...,neutral
3,@VirginAmerica it's really aggressive to blast...,negative
4,@VirginAmerica and it's a really big bad thing...,negative


In [108]:
data.isnull().sum()

text                 0
airline_sentiment    0
dtype: int64

In [109]:
data.isna().sum()

text                 0
airline_sentiment    0
dtype: int64

In [110]:
data.text[3]

'@VirginAmerica it\'s really aggressive to blast obnoxious "entertainment" in your guests\' faces &amp; they have little recourse'

3. Text pre-processing: Data preparation. **(20 Marks)**
    
    a. Html tag removal
    
    b. Tokenization.
    
    c. Remove the numbers.
    
    d. Removal of Special Characters and Punctuations.
    
    e. Conversion to lowercase.
    
    f. Lemmatize or stemming.
    
    g. Join the words in the list to convert back to text string in the dataframe. (So that each row contains the data in text format.)
    
    h. Print first 5 rows of data after pre-processing.

In [111]:
def strip_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    stripped_text = soup.get_text()
    return stripped_text

strip_html_tags(data.text[100])

'@VirginAmerica trying to add my boy Prince to my ressie. SF this Thursday @VirginAmerica from LAX http://t.co/GsB2J3c4gM'

In [112]:
#tokenization
tokenizer=ToktokTokenizer()
tokens=tokenizer.tokenize(data.text[100])
print(tokens)

['@VirginAmerica', 'trying', 'to', 'add', 'my', 'boy', 'Prince', 'to', 'my', 'ressie.', 'SF', 'this', 'Thursday', '@VirginAmerica', 'from', 'LAX', 'http://t.co/GsB2J3c4gM']


In [113]:
#remove accented characters
def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

remove_accented_chars(data.text[100])

'@VirginAmerica trying to add my boy Prince to my ressie. SF this Thursday @VirginAmerica from LAX http://t.co/GsB2J3c4gM'

In [114]:
def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

remove_special_characters(data.text[100], remove_digits=True)

'VirginAmerica trying to add my boy Prince to my ressie SF this Thursday VirginAmerica from LAX httptcoGsBJcgM'

In [115]:
# Convert to lower case, split into individual words
words = (data.text[100]).lower().split()   

In [116]:
def lemmatize_text(text):
    text = nlp(text)
    text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
    return text

lemmatize_text(data.text[100])

'@VirginAmerica try to add my boy Prince to my ressie . SF this Thursday @VirginAmerica from LAX http://t.co/GsB2J3c4gM'

In [117]:
def comment_to_words( raw_comment ):
    # Function to convert a raw string to a string of words
    # The input is a single string (a raw twitter comment), and 
    # the output is a single string (a preprocessed comment)
    #
    # 1. Remove HTML & web links
    #
    new_text = BeautifulSoup(raw_comment).get_text()
    comment_text = re.sub("http://[a-zA-Z0-9\/\.]+", '', new_text)
    #
    # 2. Remove non-letters
    #
    letters_only = re.sub("[^a-zA-Z]", " ", remove_accented_chars(comment_text)) 
    #
    # 3. Remove numbers, special characters and punctuations
    #
    simple_text = remove_special_characters(letters_only, remove_digits=True)
    # 4. Convert to lower case, split into individual words
    words = simple_text.lower()
    #.split()                             
    #
    # 5. Lemmatize text
    #
    lemmatized_text = lemmatize_text(words)
    #
    # 6. Create the set of stop words
    #
    stops = set(stopwords.words("english"))                  
    #
    # 7. Tokenize
    #
    tokenizer=ToktokTokenizer()
    words=tokenizer.tokenize(lemmatized_text)
    # 
    # 8. Remove stop words
    #
    meaningful_words = [w for w in words if not w in stops] 
    #
    # 9. Join the words back into one string separated by space and return the result.
    #
    return ( " ".join(meaningful_words))

In [118]:
comment_to_words(data.text[100])

'virginamerica try add boy prince ressie sf thursday virginamerica lax'

In [119]:
# Get the number of comments based on the dataframe column size
num_texts = data["text"].size

# Initialize an empty list to hold the clean comments
clean_text = []

# Loop over each comment; create an index i that goes from 0 to the length of the list 
for i in range( 0, num_texts ):
    # Call our function for each one, and add the result to the list of
    # clean comments
    clean_text.append( comment_to_words(data["text"][i] ) )

In [120]:
x_train, x_test, y_train, y_test = train_test_split(clean_text, data['airline_sentiment'], test_size=0.3, random_state=1)

4. Vectorization: **(10 Marks)**
   
    a. Use CountVectorizer.
    
    b. Use TfidfVectorizer.

In [121]:
print ("Creating the bag of words...\n")
from sklearn.feature_extraction.text import CountVectorizer

# Initialize the "CountVectorizer"  
count_vectorizer = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000) 

# fit_transform() fits the model and learns the vocabulary; it also transforms our training data into feature vectors.
train_data_features = count_vectorizer.fit_transform(x_train)

# Convert the result to an array
train_data_features = train_data_features.toarray()

Creating the bag of words...



In [122]:
print (train_data_features.shape)

(10248, 5000)


In [123]:
# Take a look at the words in the vocabulary
vocab = count_vectorizer.get_feature_names()
print (vocab)



In [124]:
#
# Use TfidVectorizer next
#
from sklearn.feature_extraction.text import TfidfVectorizer
tfid_vectorizer = TfidfVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 5000)
X = tfid_vectorizer.fit_transform(x_train)
print(tfid_vectorizer.get_feature_names())
print(X.shape)

(10248, 5000)


5. Fit and evaluate model using both type of vectorization. **(6+6 Marks)**



In [125]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
# Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(verbose=2,n_jobs=-1,n_estimators = 100) 
# Fit the forest to the training set, using the bag of words as features and the sentiment labels as the response variable
#
print ("Training the random forest...")
forest = forest.fit( train_data_features, y_train )
# random forest performance through cross vaidation 
print (forest)
print (np.mean(cross_val_score(forest,train_data_features,y_train,cv=10)))

Training the random forest...


[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.


building tree 1 of 100
building tree 2 of 100
building tree 3 of 100
building tree 4 of 100
building tree 5 of 100
building tree 6 of 100
building tree 7 of 100
building tree 8 of 100
building tree 9 of 100
building tree 10 of 100
building tree 11 of 100
building tree 12 of 100
building tree 13 of 100
building tree 14 of 100
building tree 15 of 100
building tree 16 of 100
building tree 17 of 100
building tree 18 of 100
building tree 19 of 100
building tree 20 of 100
building tree 21 of 100
building tree 22 of 100
building tree 23 of 100
building tree 24 of 100
building tree 25 of 100
building tree 26 of 100
building tree 27 of 100
building tree 28 of 100
building tree 29 of 100
building tree 30 of 100
building tree 31 of 100
building tree 32 of 100
building tree 33 of 100
building tree 34 of 100
building tree 35 of 100
building tree 36 of 100


[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    8.6s


building tree 37 of 100
building tree 38 of 100
building tree 39 of 100
building tree 40 of 100
building tree 41 of 100
building tree 42 of 100
building tree 43 of 100
building tree 44 of 100
building tree 45 of 100
building tree 46 of 100
building tree 47 of 100
building tree 48 of 100
building tree 49 of 100
building tree 50 of 100
building tree 51 of 100
building tree 52 of 100
building tree 53 of 100
building tree 54 of 100
building tree 55 of 100
building tree 56 of 100
building tree 57 of 100
building tree 58 of 100
building tree 59 of 100
building tree 60 of 100
building tree 61 of 100
building tree 62 of 100
building tree 63 of 100
building tree 64 of 100
building tree 65 of 100
building tree 66 of 100
building tree 67 of 100
building tree 68 of 100
building tree 69 of 100
building tree 70 of 100
building tree 71 of 100
building tree 72 of 100
building tree 73 of 100
building tree 74 of 100
building tree 75 of 100
building tree 76 of 100
building tree 77 of 100
building tree 78

[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   25.5s finished


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=2,
                       warm_start=False)


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   12.8s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   30.1s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    9.9s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   26.9s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   10.0s
[Parallel(n_jobs=-1)]:

0.7551708269817073


[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished


In [126]:
# Get a bag of words for the test set, and convert to a numpy array
test_data_features = count_vectorizer.transform(x_test)
test_data_features = test_data_features.toarray()

# Use the random forest to make sentiment label predictions
result = forest.predict(test_data_features)
print (result)


[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:    0.0s


['positive' 'negative' 'positive' ... 'negative' 'positive' 'negative']


[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.3s finished


In [127]:
print("Train Accuracy score : ", np.mean(cross_val_score(forest,train_data_features,y_train,cv=10)))
print("Test Accuracy score : ",  accuracy_score(y_test, result))
print("")
print('Confusion Matrix')

# Print the confusion matrix
print(confusion_matrix(y_test, result))
print()
print()
# Print the precision and recall, among other metrics
print(classification_report(y_test, result, digits=3))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   10.0s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   27.1s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:   10.0s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:   26.4s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    9.4s
[Parallel(n_jobs=-1)]:

Train Accuracy score :  0.7553662347560975
Test Accuracy score :  0.7566029143897997

Confusion Matrix
[[2483  184   74]
 [ 420  428   88]
 [ 189  114  412]]


              precision    recall  f1-score   support

    negative      0.803     0.906     0.851      2741
     neutral      0.590     0.457     0.515       936
    positive      0.718     0.576     0.639       715

    accuracy                          0.757      4392
   macro avg      0.703     0.646     0.669      4392
weighted avg      0.744     0.757     0.745      4392



In [128]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import numpy as np
# Initialize a Random Forest classifier with 100 trees
forest = RandomForestClassifier(verbose=2,n_jobs=-1,n_estimators = 100) 
# Fit the forest to the training set, using the bag of words as 
# features and the sentiment labels as the response variable
#
print ("Training the random forest...")
forest = forest.fit( X, y_train )
# random forest performance through cross vaidation 
print (forest)
print (np.mean(cross_val_score(forest,X,y_train,cv=10)))

Training the random forest...
building tree 1 of 100building tree 2 of 100building tree 3 of 100
building tree 4 of 100


building tree 5 of 100building tree 6 of 100

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.




building tree 7 of 100
building tree 8 of 100
building tree 9 of 100
building tree 10 of 100
building tree 11 of 100
building tree 12 of 100
building tree 13 of 100
building tree 14 of 100
building tree 15 of 100
building tree 16 of 100
building tree 17 of 100
building tree 18 of 100
building tree 19 of 100
building tree 20 of 100
building tree 21 of 100
building tree 22 of 100
building tree 23 of 100
building tree 24 of 100
building tree 25 of 100
building tree 26 of 100
building tree 27 of 100
building tree 28 of 100
building tree 29 of 100
building tree 30 of 100
building tree 31 of 100
building tree 32 of 100
building tree 33 of 100
building tree 34 of 100
building tree 35 of 100
building tree 36 of 100


[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    1.1s


building tree 37 of 100building tree 38 of 100

building tree 39 of 100
building tree 40 of 100
building tree 41 of 100
building tree 42 of 100
building tree 43 of 100
building tree 44 of 100
building tree 45 of 100
building tree 46 of 100
building tree 47 of 100
building tree 48 of 100
building tree 49 of 100
building tree 50 of 100
building tree 51 of 100
building tree 52 of 100
building tree 53 of 100
building tree 54 of 100
building tree 55 of 100
building tree 56 of 100
building tree 57 of 100
building tree 58 of 100
building tree 59 of 100
building tree 60 of 100
building tree 61 of 100
building tree 62 of 100
building tree 63 of 100
building tree 64 of 100
building tree 65 of 100
building tree 66 of 100
building tree 67 of 100
building tree 68 of 100
building tree 69 of 100
building tree 70 of 100
building tree 71 of 100
building tree 72 of 100
building tree 73 of 100
building tree 74 of 100
building tree 75 of 100
building tree 76 of 100
building tree 77 of 100
building tree 78

[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    3.1s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  58 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    3.7s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  58 tasks      | elapsed:    2.3s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    3.7s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_j

0.7591733993902439


[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    3.8s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished


In [129]:
# Get a bag of words for the test set, and convert to a numpy array
test_data_features = tfid_vectorizer.transform(x_test)
test_data_features = test_data_features.toarray()

# Use the random forest to make sentiment label predictions
result = forest.predict(test_data_features)
print (result)


[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:    0.0s


['positive' 'negative' 'negative' ... 'positive' 'positive' 'negative']


[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.3s finished


In [132]:
print("Train Accuracy score : ", np.mean(cross_val_score(forest,X,y_train,cv=10)))
print("Test Accuracy score : ",  accuracy_score(y_test, result))
print("")
print('Confusion Matrix')

# Print the confusion matrix
print(confusion_matrix(y_test, result))
print()
print()
# Print the precision and recall, among other metrics
print(classification_report(y_test, result, digits=3))

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  58 tasks      | elapsed:    2.6s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    4.0s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  58 tasks      | elapsed:    2.4s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    3.9s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  58 tasks      | elapsed:    2.4s
[Parallel(n_jobs=-1)]:

Train Accuracy score :  0.75507421875
Test Accuracy score :  0.7600182149362478

Confusion Matrix
[[2591  104   46]
 [ 497  365   74]
 [ 254   79  382]]


              precision    recall  f1-score   support

    negative      0.775     0.945     0.852      2741
     neutral      0.666     0.390     0.492       936
    positive      0.761     0.534     0.628       715

    accuracy                          0.760      4392
   macro avg      0.734     0.623     0.657      4392
weighted avg      0.750     0.760     0.739      4392



6. Summarize your understanding of the application of Various Pre-processing and Vectorization and performance of your model on this dataset. **(8 Marks)**

#### SUMMARY 

Pre-processing the text is important to remove unwanted characters and words that might unnecessarily increase the processing time and/or confuse the model. The steps in preprocessing include removal of HTML tags, non-ASCII characters, numbers and stop-words. Numbers have little meaning in sentiment analysis and similarly stop words can be removed without altering the meaning of the text. Lemmatizing is more compute intensive but more accurate than simple stemming. It helps preserve the original meaning of the text.

Vectorizing is a technique to extract more contextual information about the words in the text. TFID reflects how important a word is to a document in a collection or corpus whereas count vectorization simply counts the number of times a word appears in a document. Max features was set to 5000 for both vectorizers. Both TFID vectorization and Count vectorization produced similar Test Accuracy results **0.760 Vs 0.757** respectively, in the current project.

Average Precision **(0.734)** was slightly higher for TFID vectorization as compared to Count Vectorization **(0.703)**, whereas   Recall and F1-score were slightly lower for TFID vectorization **(0.623,  0.657)** as compared to Count Vectorization **(0.646, 0.669)**.

