# Supervised Learning
## Machine learning (ML)

The sentiment analysis program we wrote earlier (in Lab 3) adopts a non-machine learning algorithm. That is, it tries to define what words have good and bad sentiments and assumes all the necessary words of good and bad sentiments exist in the word_sentiment.csv file.


Machine Learning (ML) is a class of algorithms, which are data-driven, i.e. unlike "normal" algorithms, it is the data that "tells" what the "good answer" is. A machine learning algorithm would not have such coded definition of what a good and bad sentiment is, but would "learn-by-examples". That is, you will show several sentences which have been labeled as good sentiment and bad sentiment and a good ML algorithm will eventually learn and be able to predict whether or not an unseen sentence has a good or bad sentiment. This particular example of sentiment analysis is "supervised", which means that your example words must be labeled, or explicitly say which sentences are good and which are bad.

On the other hand, in the case of unsupervised learning, the sentence examples are not labeled. Of course, in such a case the algorithm itself cannot "invent" what a good sentiment is, but it can try to cluster the data into different groups, e.g. it can figure out that sentences that have certain  words are different from those hat have others (eg. it might cluster sentecences around words like mother,childeren etc. and find that cluster to be different from another group of sentences that contain words like politician).
There are "intermediate" forms of supervision, i.e. semi-supervised and active learning. Technically, these are supervised methods in which there is some "smart" way to avoid a large number of labeled examples. 

- In active learning, the algorithm itself decides which thing you should label (e.g. it can be pretty sure about a sentence that has the word fantastic, but it might ask you to confirm if the sentence may have a negative like “not”). 
- In semi-supervised learning, there are two different algorithms, which start with the labeled examples, and then "tell" each other the way they think about some large number of unlabeled data. From this "discussion" they learn.


#### Figure 1: Supervised learning approach


<center>
    <img src="ML_SL.png"  width="500" title="Supervised learning">
</center>

Ref 1: 
https://www.youtube.com/watch?v=nKW8Ndu7Mjw&t=382s

Ref 2:
https://www.nltk.org/book/ch06.html

## Sentiment Excercise:

In this excercise we try to improve the *word_sentiment.csv* file. We build a classifier model to predict the sentiment of an unknown word, using the the dictionary (corpus) of sentiments available in the word_sentiment.csv file. 

### Training the ML algorithm

#### Module : NLTK (Natural Language Tool Kit)
NLTK module is built for working with language data.  NLTK supports classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities. We will use the NLTK module and employ the Naive Bayes method to classify words as being either positive or negative sentiment. You can also use other modules specifically meant for ML eg. sklearn module.

The nltk module may not be included in your laptop. In case it is not installed you need to :

*pip install --user nltk*

To check if it is insatlled just run 

*import nltk*

In [1]:
pip install --user nltk

Note: you may need to restart the kernel to use updated packages.


### Step 1: Feature extraction
Define what features of a word that you want to use in order to classify the data set. We will select two features the first and the last letter of the word.

**Tokenization:**  tokenization is the task of chopping text up into pieces (e.g. words or letter / character), called tokens
* For words, the letters in the word can act as tokens

**Feature generation:** selecting the right features and determining how to encode them

* Feature 1: First letter of the word
* Feature 2: Last letter of the word

#### Figure 2: Feature Extractor


<center>
    <img src="FE2.png"  width="500" title="Feature Extraction">
</center>


#### Part 1: Feature Extractor

In [2]:
def feature_extractor(word):
    """ This is the feature extractor. Givern a word, it returns a dictonary of the first and last letter
    of the word given to it"""
    first_l = word[0]
    last_l = word[-1]
    dict_feature = {"first letter" : first_l,"last letter" : last_l}
    return dict_feature

input_word = input("Enter a word ").lower()
feature = feature_extractor(input_word)
print(feature)

Enter a word sheva
{'first letter': 's', 'last letter': 'a'}


#### Part 2: Create the feature set
We will use the corpus of sentiments from the word_sentiment.csv file to create a feature dataset which we will use to train and test the ML model. 

In [5]:
import csv

def feature_extractor(word):
    """ This is my feature extractor. Givern a word, it returns a dictonary of the first and last letter
    of the word given to it"""
    first_l = word[0]
    last_l = word[-1]
    dict_feature = {"first letter" : first_l,"last letter" : last_l}
    return dict_feature



def gen_featureset():
    """ This function creates a feature set for all the words i nthe the word_sentiment.csv file"""
    SENTIMENT_CSV=r"/Users/sheva/Downloads/TweetSentiment (1).csv"
    with open(SENTIMENT_CSV,'rt', encoding='utf-8') as csvobj:
        ws_data = csv.reader(csvobj)
        featureset = list()
        for row in ws_data:
            w_feature = list()
            feature = feature_extractor(row[0]) #extract the features
            w_feature.append(feature)
            w_feature.append(row[1]) #append the sentiment for each word feature
            featureset.append(w_feature)
            print(w_feature)
        return featureset
            
featureset = gen_featureset()

[{'first letter': 'n', 'last letter': 'e'}, 'On our balcony overlooking the beach...only 3 more nights of this! ']
[{'first letter': 'p', 'last letter': 'e'}, 'I am not fuctioning this morning. Thanks to my work wife @niki_cole for my venti dirty chai with an extra shot. She loves me ']
[{'first letter': 'n', 'last letter': 'e'}, "Can't beat all time low.. (: I soooooo want to go to Metro Station..  Your cheap shots wont be able to break bones"]
[{'first letter': 'p', 'last letter': 'e'}, '@faiththedog thanks for coming out and visiting us in texas. the message was fun and inspirational ']
[{'first letter': 'n', 'last letter': 'e'}, '@ablegamers Keep your chin up, man! Stay strong! ']
[{'first letter': 'n', 'last letter': 'e'}, "In Spanish with @TheSuperSuus  I'm bored"]
[{'first letter': 'n', 'last letter': 'e'}, "Reluctantly going to sleep - don't want to be any closer to waking up to another rainy week ahead "]
[{'first letter': 'p', 'last letter': 'e'}, 'woken up happy  look at the

In [10]:
import csv
import nltk

def feature_extractor(word):
    first_l = word[0]
    last_l = word[-1]
    length = len(word) 
    d_feature = {"first letter" : first_l,"last letter" : last_l,"length": length}
    return d_feature
    
def featureset():
    """creates the data set for training"""
    SENTIMENT_CSV=r"/Users/sheva/Downloads/TweetSentiment (1).csv"
    with open(SENTIMENT_CSV, 'rt', encoding='utf-8') as sobj:
        sdata=csv.reader(sobj)
        featureset=list()
        for row in sdata:
            features=feature_extractor(row[1])
            featureset.append([features,row[0]])
    return featureset

def ML_Train():
    """Train an NB model"""
    fset=featureset()
    train_set=fset[:4000]
    test_set=fset[4000:]
    
ML_Train()
nb_classifier=ML_Train()
i_sentence=input("please enter a sentence: ").lower()
i_features=feature_extractor(i_sentence)
sentiment=nb_classifier.classify(i_features)
print("Sentiment of the word: ", sentiment)

please enter a sentence: sheva


AttributeError: 'NoneType' object has no attribute 'classify'

### Step 2: Train the model on the training data set
##### Split the sample to training and testing data set

We will split the feature data set into training and test data sets. The training set is used to train our ML model and then the testing set can be used to check how good the model is. It is normal to use 20% of the data set for testing purposes. In our case we will retain 2000 words for training and the rest for testing.

##### Use ML method (Naive Bayes) to create the classifier model

The NLTK module gives us several ML methods to create a classifier model using our training set and based on our selected features. 

#### Figure 3: Training


<center>
    <img src="Tr2.png"  width="500" title="Training">
</center>

In [5]:
import csv
import nltk # this module has NaiveBayes classifier model

def feature_extractor(word):
    """ This is my feature extractor. Givern a word, it returns a dictonary of the first and last letter
    of the word given to it"""
    first_l = word[0]
    last_l = word[-1]
    dict_feature = {"first letter" : first_l,"last letter" : last_l}
    return dict_feature

def gen_featureset():
    """ This function creates a feature set for the word_sentiment.csv file"""
    SENTIMENT_CSV = r"C:\Users\pmedappa\Dropbox\Tilburg\Course 2020-2021\DSS\Lab 2\Sentiments\word_sentiment.csv"
    with open(SENTIMENT_CSV,'rt', encoding='utf-8') as csvobj:
        ws_data = csv.reader(csvobj)
        featureset = list()
        for row in ws_data:
            w_feature = list()
            feature = feature_extractor(row[0])
            w_feature.append(feature)
            w_feature.append(row[1])
            featureset.append(w_feature)
        return featureset
            

def ML_train():
    """ This will train the classifier using the word sentiment feature set"""
    featureset = gen_featureset()
    # split the feature set into training and testing sets
    train_set = featureset[:2000]
    test_set = featureset[2000:]
    classifier = nltk.NaiveBayesClassifier.train(train_set) #Train the NaiveBayes model using the training data set
    return classifier

nb_classifier = ML_train()
print(nb_classifier)

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\pmedappa\\Dropbox\\Tilburg\\Course 2020-2021\\DSS\\Lab 2\\Sentiments\\word_sentiment.csv'

### Step 3: Using the classifier object created predict the sentiment of a given word



#### Figure 4: Prediction


<center>
    <img src="Pr.png"  width="500" title="Prediction">
</center>

In [1]:
import csv
import nltk

def feature_extractor(word):
    """ This is my feature extractor. Givern a word, it returns a dictonary of the first and last letter
    of the word given to it"""
    first_l = word[0]
    last_l = word[-1]
    dict_feature = {"first letter" : first_l,"last letter" : last_l}
    return dict_feature

def gen_featureset():
    """ This function creates a feature set for the word_sentiment.csv file"""
    SENTIMENT_CSV = r"C:\Users\pmedappa\Dropbox\Tilburg\Course 2020-2021\DSS\Lab 2\Sentiments\word_sentiment.csv"
    with open(SENTIMENT_CSV,'rt', encoding='utf-8') as csvobj:
        ws_data = csv.reader(csvobj)
        featureset = list()
        for row in ws_data:
            w_feature = list()
            feature = feature_extractor(row[0])
            w_feature.append(feature)
            w_feature.append(row[1])
            featureset.append(w_feature)
        return featureset
            

def ML_train():
    """ This will train the classifier using the word sentiment feature set"""
    featureset = gen_featureset()
    train_set = featureset[:2000]
    test_set = featureset[2000:]
    classifier = nltk.NaiveBayesClassifier.train(train_set)
    return classifier

i_word = input("Enter a word : ").lower()
nb_classifier = ML_train()
i_features = feature_extractor(i_word) # extract the features of the input word
sentiment = nb_classifier.classify(i_features) # use the nb classifier to find the sentiment of the input word
print("Sentiment of the word : ", sentiment)

Enter a word : good
Sentiment of the word :  -2


### Step 4: Evaluating the model

Find how good the model is in identifying the labels. Ensure that the test set is distinct from the training corpus. If we simply re-used the training set as the test set, then a model that simply memorized its input, without learning how to generalize to new examples, would receive misleadingly high scores. The function nltk.classify.accuracy() will calculate the accuracy of a classifier model on a given test set.

#### Figure 5: Evaluation


<center>
    <img src="Ev.png"  width="500" title="Evaluation">
</center>

In [None]:
import csv
import nltk
import random

def feature_extractor(word):
    """ This is my feature extractor. Givern a word, it returns a dictonary of the first and last letter
    of the word given to it"""
    first_l = word[0]
    last_l = word[-1]
    dict_feature = {"first letter" : first_l,"last letter" : last_l}
    return dict_feature

def gen_featureset():
    """ This function creates a feature set for the word_sentiment.csv file"""
    SENTIMENT_CSV = r"C:\Users\pmedappa\Dropbox\Tilburg\Course 2020-2021\DSS\Lab 2\Sentiments\word_sentiment.csv"
    with open(SENTIMENT_CSV,'rt', encoding='utf-8') as csvobj:
        ws_data = csv.reader(csvobj)
        featureset = list()
        for row in ws_data:
            w_feature = list()
            feature = feature_extractor(row[0])
            w_feature.append(feature)
            if int(row[1]) >= 0: #Club the sentiment value into positive and negative (two labels)
                sentiment = 'positive'
            else: sentiment = 'negative'
            w_feature.append(sentiment)
            featureset.append(w_feature)
        return featureset
            

def ML_train():
    """ This will train the classifier using the word sentiment feature set"""
    featureset = gen_featureset()
    random.shuffle(featureset) # !! IT IS IMPORTANT TO SHUFFLE THE LABELLED DATA SINCE THE WORDS ARE ALPHABETICAL
    train_set = featureset[:2000]
    test_set = featureset[2000:]
    classifier = nltk.NaiveBayesClassifier.train(train_set)
    acc = nltk.classify.accuracy(classifier,test_set) #Find accuracy of the model using the test set
    print("The accuracy of the model is ", acc)
    classifier.show_most_informative_features() #show the most informative features
    return classifier

i_word = input("Enter a word : ").lower()
nb_classifier = ML_train()
i_features = feature_extractor(i_word)
print("Features extracted from the word : ",i_features)
sentiment = nb_classifier.classify(i_features)
print("Sentiment of the word : ", sentiment)

##### NOTE: Improvements made to increase accuracy
- Shuffle the corpus so that the issue of alphebetic ordering of the word is overcome 
- Reduce variance in outcome by clubbing it (i.e. change the range of sentiment from -5 to 5 .. to 'posiitve' and 'negative')

### Excercise 1

*Improve the feature extractor (by adding new features) so that the test accuracy goes up by a bit. Can you reach 70% accuracy?.*

In [None]:
#Enter code here
import csv
import nltk
import random

def feature_extractor(word):
    """ This is my feature extractor. Givern a word, it returns a dictonary of the first and last letter
    of the word given to it"""
    first_l = word[0]
    last_l = word[-1]
    first2_l = word[0:2]
    last2_l = word[-2:]
    l = len(word)
    dict_feature = {"first 2 letters" : first2_l,"last 2 letters" : last2_l,
                    "first letter" : first_l,"last letter" : last_l,
                    "length" : l}
    return dict_feature

def gen_featureset():
    """ This function creates a feature set for the word_sentiment.csv file"""
    SENTIMENT_CSV = "C:/Users/pmedappa/Dropbox/Tilburg/Course 2019-2020/DSS/Lab4/Sentiments/word_sentiment.csv"
    with open(SENTIMENT_CSV,'rt', encoding='utf-8') as csvobj:
        ws_data = csv.reader(csvobj)
        featureset = list()
        for row in ws_data:
            w_feature = list()
            feature = feature_extractor(row[0])
            w_feature.append(feature)
            if int(row[1]) > 0:
                sentiment = "positive"
            else: sentiment = "negative"
            w_feature.append(sentiment)
            featureset.append(w_feature)
        return featureset
            

def ML_train():
    """ This will train the classifier using the word sentiment feature set"""
    featureset = gen_featureset()
    random.shuffle(featureset)
    train_set = featureset[:2000]
    test_set = featureset[2000:]
    classifier = nltk.NaiveBayesClassifier.train(train_set)
    acc = nltk.classify.accuracy(classifier,test_set)
    print("The accuracy of the model is ", acc)
    classifier.show_most_informative_features()
    return classifier

i_word = input("Enter a word : ").lower()
nb_classifier = ML_train()
i_features = feature_extractor(i_word)
print("Features extracted from the word : ",i_features)
sentiment = nb_classifier.classify(i_features)
print("Sentiment of the word : ", sentiment)

### Excercise 2

Build a classifier model that can predict the sentiment of the sentence. Please use the TweetSentiment.csv file provided for training and testing. This file has a list of 5000 tweets ranked on the basis of their sentiment (positive and negative).

In [None]:
 # discussed in lab 4
    
import csv
import nltk
import random

def feature_extractor(sentence):
    l_sentence = sentence.split()
    f_word = l_sentence[0]
    l_word = l_sentence[-1]
    length = len(l_sentence) 
    
    if length >= 2:
        f2_word = l_sentence[0]+" "+l_sentence[1]
        l2_word = l_sentence[-2]+" "+l_sentence[-1]
    else:
        f2_word = ""
        l2_word = ""
    d_feature = {"first word" : f_word,"last word" : l_word,"length": length, "first two words": f2_word, "last 2 words": l2_word}
    return d_feature
    
def featureset():
    """creates the data set for training"""
    SENTIMENT_CSV=r"/Users/sheva/Downloads/TweetSentiment (1).csv"
    with open(SENTIMENT_CSV, 'rt', encoding='utf-8') as sobj:
        sdata=csv.reader(sobj)
        featureset=list()
        for row in sdata:
            features=feature_extractor(row[1])
            featureset.append([features,row[0]])
    return featureset

def ML_Train():
    """Train an NB model"""
    fset=featureset()
    random.shuffle(fset)
    train_set=fset[:4000]
    test_set=fset[4000:]
    classifier = nltk.NaiveBayesClassifier.train(train_set) #Train the NaiveBayes model using the training data set
    acc = nltk.classify.accuracy(classifier, test_set)
    print("the accuracy of the NB model is ", acc)
    classifier.show_most_informative_features()
    
    dt_classifier = nltk.DecisionTreeClassifier.train(train_set)
    acc = nltk.classify.accuracy(classifier, test_set)
    print("the accuracy of the DT model is ", acc)
    classifier.show_most_informative_features()
    
    return classifier

    
ML_Train()
nb_classifier=ML_Train()
i_sentence=input("please enter a sentence: ").lower()
i_features=feature_extractor(i_sentence)
sentiment=nb_classifier.classify(i_features)
print("Sentiment of the word: ", sentiment)

the accuracy of the NB model is  0.589
Most Informative Features
               last word = 'fun'          positi : negati =      7.0 : 1.0
                  length = 29             negati : positi =      5.8 : 1.0
              first word = 'getting'      positi : negati =      5.0 : 1.0
              first word = 'watching'     positi : negati =      5.0 : 1.0
               last word = 'go'           negati : positi =      5.0 : 1.0
               last word = 'thanks'       positi : negati =      5.0 : 1.0
               last word = 'today.'       negati : positi =      5.0 : 1.0
         first two words = 'I miss'       negati : positi =      5.0 : 1.0
         first two words = "I don't"      negati : positi =      4.3 : 1.0
               last word = 'haha'         positi : negati =      4.3 : 1.0
the accuracy of the DT model is  0.589
Most Informative Features
               last word = 'fun'          positi : negati =      7.0 : 1.0
                  length = 29             neg

#### Note: Development testing and error analysis

Using a seperate dev-test set, we can generate a list of the errors that the classifier makes when predicting the sentiment. We can then examine individual error cases where the model predicted the wrong label, and try to determine what additional pieces of information would allow it to make the right decision (or which existing pieces of information are tricking it into making the wrong decision). The feature set can then be adjusted accordingly.

In [None]:
### assignment

import panda as pd

new_tweet_csv = r"Trump_Tweets_2019.csv"
index_csv = r"nasdaq"
memrge_csv = r"merged_trump_nasdaq"





##2nd part of the assignment
import nltk
nltk.download('stopwords')

#

import nltk
import random
import csv

def tokenize(sentence):
    """this function does the task of converting a sentence into a list of words"""
    t_words = sentence.split()
    return t_words

def removestopwords():
    stop_words=set(stepwords.words("english"))
    filtered_tokens=list()
    for w in tokens:
        if w not in stop_words:
            filtered_tokens.append(w.lower())        
    return filtered_tokens


def mostfreq_words(all_tokenwords):
    all_words=nltk.FreqDist(all_tokenwords)
    word_features=list(all_words)[:5000]
    return word_features
    

merge_csv = r"merged_tweet_NASDAQ_2019"
with open (merge_csv, 'rt', encoding='utf-8') as tweetobj:
    classified_tweet=csv.reader(tweetobj)
    for tweet in classified_tweet
    tokens = tokenize(tweet[2])
    print(tokens)
    
    filtered_tokens = removestopwords(tokens)
    print(filtered_tokens)