# Documentation

Ideas Involved:
1. Removing Stop words but keeping negation word as they hold importance for analysing the sentiments
2. Removing Proper Nouns (Reduce processing time)
3. Using frequencies of words generated over whole dataset as a metric to convert reviews into vector (More detail in Method 1)
4. For a particular word in a word vector list, using its similar words in a word dataset (which can have low frequency but hold sentiments) to convert reviews into vector (More detail in Method 2)

### **Analysing Performance of Different Classifier**
| Classifier |Method 1 $\\$(500 features) $\\$ (Without Removing Nouns) |Method 1 $\\$(500 features) $\\$ (With Removing Nouns)|Method 1 $\\$(1000 features) $\\$ (With Removing Nouns)| Method 2 $\\$ (1000 features) $\\$ (Without Removing Noun) | Method 2 $\\$(1000 features)$\\$ (With Removing Noun)
|:-----------|:-------------------:|:-------------------:|:-------:|:-----:|:-----:|
|Logistic Regression|**0.8355** (2.7 sec)|**0.8398** (2.3 sec) |**0.84288** (5.2 sec)| **0.8576** | **0.8628**|
|Naive Bayes (LDA)|**0.8296** (5.4 sec)|**0.8336** (3.7 sec)|**0.83488** (8 sec) | **0.85296** | **0.8588** |
|KNN (k=47)|**0.7206** (15 sec) |**0.7184** (7 sec)|**0.71088** (8.3 sec) | NA | NA|
|SVM|**0.8301** (11 min)|**0.8349** (7 min 13 sec)|**0.8384** (15 min) | NA | NA |
|Random Forest(1000 DT)|**0.7835** (28 min) | NA |NA | NA | NA|

**Note:** 
1. **NA** means accuracy is not calculated due to high processing time and availability of better classifier. 
2. For KNN, k=47 is choosen after iterating over multiple k values and finding the one with max accuracy. 
3. For Random Forest, increasing DT increase the accuracy but also increase the processing time.
4. Max accuracy over Validation Dataset achieved is **0.8628**

### Improvement:
1. Removing proper nouns helps in increasing accuracy
2. Adding Negation words increase the accuracy from **0.8576** to **0.8628** (in Method 2 with removing nouns)

### Kaggle Score
1. **Score :0.85332**

Here is the reference
![image info](kaggle_score.png)

# Installation

In [None]:
! pip install nltk gensim textblob scikit-learn scikit-image pandas numpy

# Libraries

In [12]:
# Python Inbuilt Libraries
import logging      # Displaying Formatted Log Output
import re           # Regular Expression
import string       # Ueed to get punctuations (string.punctuations)
import operator     # Number of Occurrence of Word in a List
import collections  # Generating Frequecy of words in a list
import os           # Location and Finding Files

# Data Processing Libraries
import pandas as pd # Processing Input and Output Files (.tsv)
import numpy as np  # Handling Arrays

# NLP libraries
import nltk.data                                 # Tokenization
from nltk.corpus import stopwords           # Finding StopWords
from nltk.stem import 	WordNetLemmatizer   # Lemmatization
from textblob import TextBlob               # POS tagging
from gensim.models import word2vec          # Word2Vector

# Classification Model Libraries
from sklearn.model_selection import train_test_split    # Splitting training Dataset
from sklearn.ensemble import RandomForestRegressor      # Random Forest
from sklearn import svm                                 # SVM
from sklearn.neighbors import KNeighborsClassifier      # KNN
from sklearn.naive_bayes import GaussianNB              # Gaussion Distribution (Naive Bayes)
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis # LDA,QDA (Naive Bayes)
from sklearn.linear_model import LogisticRegression     # Logistic Regression 



# Load Data

In [13]:
# Training Data (Labeled)
labelled_train = pd.read_csv( "../Dataset/labeledTrainData.tsv", header=0, delimiter="\t", quoting=3 )
# Training Data (Unlabeled)
unlabeled_train = pd.read_csv( "../Dataset/unlabeledTrainData.tsv", header=0, delimiter="\t", quoting=3 )
# Test Data
test = pd.read_csv( "../Dataset/testData.tsv", header=0, delimiter="\t", quoting=3 )


# Clean Reviews
Cleaning Involves
1. Remove HTML Tags
2. Remove Non-alphabetic Symbols
3. Lemmatization, Removing Proper Nouns, Remove Stop Words **(Keeping Negation Stop Words)**, Remove Punctuations and Empty String
4. Storing 

In [14]:

# Intialize Lemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
CLEANR = re.compile('<.*?>') 
# Negation Words (to be removed from stop words)
negation_words = ["not","nor","none","never","neither"]

def clean_review(revs_unclean,removeStopWords = False,removeNoun = False, display = False,needFlattenWords = False):
    '''
    @param:
        revs_unclean:       list of strings (list of uncleaned reviews string)
        removeStopWords:    bool (marker for removing stop words)
        removeNoun:         bool (marker for removing Proper Nouns)
        display:            bool (marker for displaying info)
        needFlattenWords:   bool (marker for flattening the list of list of words)

    @output:
        rev_clean: list of list of string (list of list containing cleaned words of the reviews)
        all_words: flattened rev_clean (used to get word frequency)
    '''

    # Stop words exculding negation words
    stops = set(stopwords.words("english")) - set(negation_words)

    # checker for Noun (Proper Noun)
    is_noun = lambda pos: pos[:2] == 'NN' and len(pos) >2

    rev_clean =[] #Intialize empty clean review array

    all_words = []

    if display:
        print("Cleaning Started")
        
    cleaned = 0
    for review in revs_unclean:
        
        # if review is empty
        if not (len(review) > 0):
            print("\nEmpty review found")
            continue
        
        #otherwise

        # 1. Remove HTML Tags
        review_text = re.sub(CLEANR, '', review)

        # 2. Remove Non-alphabetic Symbols
        review_text = re.sub("[^a-zA-Z]"," ",review_text)

        if removeNoun:
            # POS Tagging
            blob = TextBlob(review_text)
        
            # 3. Lemmatization, Removing Proper Nouns, Remove Stop Words (Keeping Negation Stop Words), Remove Punctuations and Empty String
            words = [
                wordnet_lemmatizer.lemmatize(w.lower())                     # Storing Lemmatized word
                for w,pos in blob.pos_tags 
                    if (not is_noun(pos))                                   # Removing Nouns
                    and             
                    (not removeStopWords or w.lower() not in stops)         # Removing Stop Words
                    and    
                    w not in string.punctuation                             # Removing Punctuations
                    and
                    len(w) > 0                                              # Non Empty
                ]
        else:

            # 3. Lemmatization, Remove Stop Words (Keeping Negation Stop Words), Remove Punctuations and Empty String
            words = [
                wordnet_lemmatizer.lemmatize(w.lower())  
                for w in review_text.split()
                if (not removeStopWords or w.lower() not in stops)         # Removing Stop Words
                    and    
                    w not in string.punctuation                            # Removing Punctuations
                    and
                    len(w) > 0                                             # Non Empty
            ]
        
        # 4. Storing 
        rev_clean.append(words)

        # Flattening the words
        if needFlattenWords:
            all_words += words

        # Display
        if display:
            cleaned+= 1
            print("\f\rCleaned: {}/{}".format(cleaned,len(revs_unclean)),end="")
    if display:
        print("\nReviews Cleaned!")
    return rev_clean,all_words



# Finding Most Frequent Words
To create vectors from reviews (list of words), we need most frequent words used in every reviews (Excluding Stop words) 

In [15]:

def word_frequency_in_descending_order(all_words):
    '''
    @param
        all_words: list of strings (list of  words)
    @output
        frequent_words: list of words sorted by their frequency in descending order (most frequent is at first)
    '''

    # count the word frequency and create dict
    word_freq = dict(collections.Counter(all_words))

    # sort the dict by their values (frequency)
    frequent_words = sorted(word_freq,key=word_freq.get,reverse=True)

    return frequent_words

# Method 1
This Method Invloves directly convert the reviews (list of words) into vector (list of numbers) by counting the occurrence of a word from a word_list in the review.
Vector[i] = number of occurrence of Word_List[i] in reviews[i]

## Generate Vector from review 

In [16]:

def generate_vector_direct(revs_clean,words):
    '''
    @param:
        revs_clean: list of list of string (list of cleaned reviews (list of words) )
        words: list of words (word list vector)
    @output;
        list of list of numbers (list of vectors representing reviews)
    '''
    print("Generating Vectors")
    return [
            [
                operator.countOf(review,word) # counting occurrence of word in the review
                for word in words
            ]
            for review in revs_clean
        ]
 

# Method 2
This method involves creating a word_matrix instead of word_list vector. Word_list matrix contains most frequent words (similar to word_list vector) as their first column whereas other columns do contain the word similar (by meaning) to the first word of their respective row.
```
word_matrix: [
    most freq word 1 (W1) , words with similar meaning to W1 ...
    most freq word 2 (W2) , words with similar meaning to W2 ...
    most freq word 3 (W3) , words with similar meaning to W3 ...
    most freq word 4 (W4) , words with similar meaning to W4 ...
    .
    .
    .
]
```

Now, Vector[i] =  $\sum_{w \in word\_matrix[i]} $ Occurrence of w in review[i]

## Generate Dataset for the word2vec
To get similar words, we will use word2vec to get the relationship between words. We will use reviews from labelled and unlabelled dataset to prepare dataset for word2vec 

In [17]:
# generate list of list of words from review
def get_sentences(review,tokenizer):

    '''
    @param:
        review: string (uncleaned review)
        tokenizer: tokenizer 
    @output:
        list of list of processed words
    '''

    # Tokenization
    raw_sentences = tokenizer.tokenize(review.strip())

    # returning cleaned reviews
    return clean_review(raw_sentences)[0]

def prepare_dataset():
    
    '''
    @param:
        None
    @output:
        sentences: list of list of strings (list of list of processed words)
    '''
    # Intialize tokenizer
    tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
    sentences = []  

    # Parsing labeled_train data
    print( "Parsing sentences from training set")
    for review in labelled_train["review"]:
        sentences += get_sentences(review, tokenizer)

    # Parsing unlabeled_train data
    print( "Parsing sentences from unlabeled set")
    for review in unlabeled_train["review"]:
        sentences += get_sentences(review, tokenizer)
    
    return sentences


## Train Word2Vec Model

In [18]:

def train_model(sentences,num_features,min_word_count,num_workers,context,downsampling,epoch = 5,PrintLogInfo = True):
    '''
    @param:
        sentences:  list of list of words (dataset)
        num_features:       Word vector dimensionality  
        min_word_count:     Minimum word count
        num_workers:        Number of threads to run in parallel
        context:            Context window size 
        downsampling:       Downsample setting for frequent words
        epoch:              Number of Epochs
        PrintLogInfo:       Printing Log INFO

    @output:
        model: trained model over the dataset
    '''

    # Print log info
    if PrintLogInfo:
        logging.basicConfig(format='\f\r%(asctime)s : %(levelname)s : %(message)s',level=logging.INFO)

    # model name (to save model)
    MODEL_NAME = "word2vec_model(500-40-25)"

    # if model exists
    if os.path.isfile(MODEL_NAME):
        model = word2vec.Word2Vec.load("300features_40minwords_10context")
    else:
        # otherwise train
        print ("Training model...")
        model = word2vec.Word2Vec(sentences, workers=num_workers, \
                    vector_size=num_features, min_count = min_word_count, \
                    window = context, sample = downsampling,epochs=epoch)

        # init_sims will make the model much more memory-efficient.
        model.init_sims(replace=True)

        # save the model
        model.save(MODEL_NAME)
    return model

## Generate vectors from reviews 

In [19]:
# find sum of occurrence of each word of wordlist in sen 
def occurrence_of_list(sen,wordlist):
    '''
    @param;
        sen: list of words (a review)
        wordlist: list of words (row of a word_matrix)
    '''
    count = 0
    for word in wordlist:
        count += operator.countOf(sen,word)
    return count

# generate word matrix
def generate_word_list(model,words,e):
    '''
    @param:
        model: model to determine similar words
        words: list of strings (word_list vector)
        e: number of columns (max number of similar words to be found)
    '''

    wordmatrix = [] # empty word matrix
    for word in words:
        templist = [word] # add the word
        try:
            # add the top e most similar words  
            newlist = np.array(model.wv.most_similar(word,topn=e))[:,0].tolist()
        except:
            print("Not Found: ",word)
            newlist = [] # if word is not in vocablary of model
        # add row
        wordmatrix.append(templist + newlist)
    return wordmatrix
# generate vectors
def generate_vector_indirect(rev_clean,word_matrix):
    '''
    @param:
        rev_clean: list of list of words  (cleaned reviews)
        word_matrix: list of list of words (word matrix)

    @ouput:
        list of list of numbers (list of vectors representing reviews)
    '''
    print("Generating Vectors")
    return [
            [
                occurrence_of_list(review,word_list)  # find sum of occurrence of each word of word_list in review
                for word_list in word_matrix
            ] 
            for review in rev_clean
        ]
        


# Reviews to Vectors

## Cleaning Labeled Data

In [20]:
review_unclean = labelled_train['review'].to_numpy() # uncleaned reviews
sentiments = labelled_train['sentiment'].to_numpy()  # sentiments

# cleaned reviews and flatten words
review_clean,all_words = clean_review(review_unclean,True,True,True,True)

Cleaning Started
Cleaned: 25000/25000

# Producing vectors using respective method

In [26]:
METHOD  = 2 # (Default)
feature_size = 1000     # word vector size
incr = 2                # skipping intial most frequenct words

# generating word vector of size (feature_size) with increment of (incr)
words = word_frequency_in_descending_order(all_words)[incr:feature_size+incr]

if METHOD == 1:

    # generating vectors
    review_vectors = generate_vector_direct(review_clean,words)
elif METHOD == 2:

    # dataset
    sentences = prepare_dataset()

    #model training
    model = train_model(sentences=sentences,num_features=500,min_word_count=40,
                        num_workers=4,context=25,downsampling=1e-3,epoch=10)
    
    #word matrix
    words_of_word = generate_word_list(model,words,10)

    # generating vectors
    review_vectors = generate_vector_indirect(review_clean,words_of_word)
print("Vector Generation Completed")

Generating Vectors


# Classification Models

## Random Forest

In [None]:
def random_forest_classifier(train_features,train_labels,test_features,test_labels,n_estimators = 1000, seed = 42):
    
    # Train the model on training data
    rf = RandomForestRegressor(n_estimators = n_estimators, random_state = seed)
    rf.fit(train_features, train_labels)

    # prediction on validation dataset
    predictions = rf.predict(test_features)
    predictions = [1 if p >= 0.5 else 0 for p in predictions]

    # Accuracy
    predictions = np.array(predictions)
    test_labels = np.array(test_labels)
    accuracy = 1 - np.count_nonzero(test_labels-predictions)/len(test_labels)

    return rf,accuracy

## Support Vector Machine

In [None]:
def svm_classifier(train_features,train_labels,test_features,test_labels,kernel="linear"):
    
    # Train the model on training data
    svm_clf = svm.SVC(kernel=kernel)
    svm_clf.fit(train_features,train_labels)

    # prediction on validation dataset
    predictions = svm_clf.predict(test_features)

    # Accuracy
    test_labels = np.array(test_labels)
    accuracy = 1 - np.count_nonzero(test_labels-predictions)/len(test_labels)

    return svm_clf,accuracy

## K Nearest Neighours

In [None]:
def KNN_classifier(train_features,train_labels,test_features,test_labels,k=47):

    # Train the model on training dataset
    knn_clf = KNeighborsClassifier(n_neighbors=k)
    knn_clf.fit(train_features, train_labels)

    # prediction on validation dataset
    predictions = knn_clf.predict(test_features)

    # Accuracy
    test_labels = np.array(test_labels)
    accuracy = 1 - np.count_nonzero(test_labels-predictions)/len(test_labels)

    return knn_clf,accuracy

## Naive Bayes

In [None]:
def naive_bayes_classifier(train_features,train_labels,test_features,test_labels,distribution="LDA"):

    # Train the model on training dataset
    if distribution.upper() == "LDA":
        bayes_clf = LinearDiscriminantAnalysis()
    elif distribution.upper() == "GAUSSIAN":
        bayes_clf = GaussianNB()
    elif distribution.upper() == "QDA":
        bayes_clf = QuadraticDiscriminantAnalysis()
    bayes_clf.fit(train_features,train_labels)

    # Prediction on validation Dataset
    predictions = bayes_clf.predict(test_features)

    # Accuracy
    test_labels = np.array(test_labels)
    accuracy = 1 - np.count_nonzero(test_labels-predictions)/len(test_labels)

    return bayes_clf,accuracy

## Logistic Regression

In [23]:
def logReg_classifier(train_features,train_labels,test_features,test_labels):

    # Train the model on training dataset
    log_clf = LogisticRegression()
    log_clf.fit(train_features,train_labels)

    # Prediction on validation dataset
    predictions = log_clf.predict(test_features)

    # Accuracy
    test_labels = np.array(test_labels)
    accuracy = 1 - np.count_nonzero(test_labels-predictions)/len(test_labels)

    return log_clf,accuracy

# Split Dataset

In [27]:
# spliting datset into train and valid dataset
train_features, valid_features, train_labels, valid_labels = train_test_split(review_vectors,sentiments, test_size = 0.25, random_state = 42)
print('Training Features Shape:', np.array(train_features).shape)
print('Training Labels Shape:', np.array(train_labels).shape)
print('Testing Features Shape:', np.array(valid_features).shape)
print('Testing Labels Shape:', np.array(valid_labels).shape)

Training Features Shape: (18750, 1000)
Training Labels Shape: (18750,)
Testing Features Shape: (6250, 1000)
Testing Labels Shape: (6250,)


# Classifier and Their Accuracy

In [28]:
CLASSIFIER = ["LogisticRegression","SVM","NaiveBayes","KNN","RandomForest"]
num = 0 # choose classifier

if CLASSIFIER[num] == "LogisticRegression":
    model,accuracy = logReg_classifier(train_features,train_labels,valid_features,valid_labels)
    print("{} : {}".format(CLASSIFIER[num],accuracy))
elif CLASSIFIER[num] == "SVM":
    model,accuracy = svm_classifier(train_features,train_labels,valid_features,valid_labels)
    print("{} : {}".format(CLASSIFIER[num],accuracy))
elif CLASSIFIER[num] == "NaiveBayes":
    model,accuracy = naive_bayes_classifier(train_features,train_labels,valid_features,valid_labels)
    print("{} : {}".format(CLASSIFIER[num],accuracy))
elif CLASSIFIER[num] == "KNN":
    model,accuracy = KNN_classifier(train_features,train_labels,valid_features,valid_labels)
    print("{} : {}".format(CLASSIFIER[num],accuracy))
elif CLASSIFIER[num] == "RandomForest":
    model,accuracy = random_forest_classifier(train_features,train_labels,valid_features,valid_labels)
    print("{} : {}".format(CLASSIFIER[num],accuracy))


    

LogisticRegression : 0.8544


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# Prepare TestData 

In [None]:
test_review_unclean = test['review'].to_numpy()

# cleaning test data
test_review_clean = clean_review(test_review_unclean,True,True)

# vectorizing 
if METHOD == 1:
    test_review_vector = generate_vector_direct(test_review_clean,words)
elif METHOD == 2:
    test_review_vector = generate_vector_indirect(test_review_clean,words_of_word)

# Save To Submission.csv

In [None]:
ids = test['id'].to_list()
test_prediction = model.predict(test_review_vector) # predicition using classification model

# creating empty dataFrame
test_output_df = pd.DataFrame(columns = ["id","sentiment"])
test_output_df['id'] = ids # IDs
test_output_df['sentiment'] = test_prediction #Prediction

# Saving to submission.csv
test_output_df.to_csv("submission.csv",index=False,sep=",",encoding='utf-8')
