# Mini- Project 1 Notebook

The purpose of this mini project is to be able to implement a regressor that can estimate with a reasonable mean squared error, the popularity of a reddit commnet, given it's controversiality, whether or not that comment is a root comment, the number of children it has. Furthermore, apart from these features, we are meant to implement text features and extend those by ourselves.

For this notebook to work, make sure the file "proj1_data_loading.py" is in your current directory, and that you have the other modules specified below. This notebook will outline without showing the nitty gritty how our code works.

In [3]:
import proj1_data_loading as p1
import json 
import collections
import numpy as np
import argparse
import string
import nltk
from nltk.corpus import stopwords

redcoms = p1.RedditComments()
redcoms.load()

Redcoms just loads our own handcoded RedditComments class. The class contains many methods useful for manipulating, preprocessing, and running regression on our dataset. It also has one attribute, data that contains well, all the data in the json file. 

redcoms.load() simply sets the data attribute to the contents of the json file.

Below, we will set a couple of pre-processing parameters and hyper-parameters for our model and data. The hyper-parameters are classic for gradient descent, where n0 specifies the starting value of alpha.

Here is a description of what each boolean  does:

## Parameter Descriptions 
#### closed_form :
if set to True, our model will run the closed form approach to linear regression. This model always finds the best training_mse, but can often be costly to compute for huge matrices.

#### text_features: 
if set to True, will preprocess the data with text_features, appending a matrix the size of (NUM_INSTANCES * most_common) to the initial training data. This array describes the commonalities of the comments... that is to say we would first find the N most common words in all of the comments, then go through each comment and check if any of the N words are present in that comment, and if so how many. It then appends the number of instances of that word to the text_features matrix, at the index i,j... where i is the instance_number and j is the index of the word, determined by how common it is throughout the dataset.
This boolean feature must be run with a most_common feature greater than 0.

#### most_common:
Integer number specifying how many of the most common words we will be looking at. Must be used with either text_features or binary_features set to True.

#### swear_words:
If set to True, will go through each comment and check how many swear words it contains. Will append that array to our data.

#### stop_words:
If set to True, will disqualify any stop words such as "The, a, I" to enter into the most common text_features

#### discrete_swear_words:
Much like swear words, except binarized: If the comment contains any swear word at all, it takes a value of one. If not, 0.

#### comment_lenght:
If set to True, will go through each comment, count it's lenght, and if the length is greater than 6, gives that comment a 1. If not, it gives it a zero.

#### binary_text:
Alot like text features, but instead of counting how many times each most common word appears in each comment, just gives a one if it appears in that comment at all... and a 0 if it doesn't

#### k_fold_cross_validation:
If set to true, will run cross_validation. We use this to test our models further and check if they are overfitting or not. Must be run with num_folds set to a number. Preferably 10, as is standard.

#### num_folds:
Integer number of folds to split the 11000 train/split array of data. We recommend 10. 



In [5]:
#SGD Hyperparameters
args = {
    'n0':1e-5,
    'beta':1e-4,
    'epsilon':1e-6,
    'maxiters':10
     }
    ###################################################################################

#should we run SGD or closed form regression?
closed_form = False

#preprocessing data features you want
text_features = False
most_common = 10
swear_words =False
stop_words =  False
discrete_swear_words = False
comment_length = False
binary_text = False
    
    #cross validation boolean
k_fold_cross_validation = True
num_folds = 10


## Preprocessing
Now for actually running the preprocessing... 
First, with the non-cross validated case: simply call our preprocess_features in the RedditComments class, with all the booleans appended as kwargs. Make sure you define how split up the data in how you index the data matrix, like below. Our preprocess method takes care of all the preprocessing features mentioned above that you desire.

Here, we can see what X_val looks like. If all parameters are false, it should just have 4 columns... the first for the bias term, and the last three for the three base text_features.

In [11]:
if (not k_fold_cross_validation):
    # Preprocess text, inputs into integers to be formatted into data matrix
    text_train, X_train, y_train = redcoms.preprocess_features(redcoms.data[:10000], text_features=text_features, most_common=most_common, stop_words=stop_words, swear_words=swear_words, discrete_swear_words=discrete_swear_words, comment_length=comment_length, binary_text=binary_text)
    text_val, X_val, y_val = redcoms.preprocess_features(redcoms.data[10000:11000], text_features=text_features, most_common=most_common, stop_words=stop_words,  swear_words=swear_words, discrete_swear_words=discrete_swear_words, comment_length=comment_length, binary_text=binary_text)
    text_test, X_test, y_test = redcoms.preprocess_features(redcoms.data[11000:12000], text_features=text_features, most_common=most_common,stop_words=stop_words,  swear_words=swear_words, discrete_swear_words=discrete_swear_words, comment_length=comment_length, binary_text=binary_text)

    print(X_val)   

## Regression
Now, for actually running regression on the preprocessed data... simply call the regress method in the RedditComments class. Give it as input the boolean of whether you want a closed form or SGD solution, as well as the training data you want, and the text_train and args parameter in case it runs stochastic gradient descent. It outputs the training weights, the training MSE, and the training time. 

Once you have that, you can take those weights and input them into the get_mse method of our class like shown below. It takes as input X and y data that you wish to try out these weights on, as well as the weights you want to try. It outputs the error and a squared error array, which specifies the Squared error for each entry in Y and the prediction of Y according to our model.

In [7]:
if not k_fold_cross_validation:
    training_weights, training_error, training_time = redcoms.regress( closed_form, X_train, y_train, text_train, args )    
    validation_error, squared_error_array = redcoms.get_mse(X_val, y_val, training_weights)
    if closed_form:
        print("Closed form training time:   " + str(training_time))
        print("Closed form training mean squared error: " +  str(training_error[0]))
        print("closed form validation mean squared error:    " + str(validation_error));
    else:
        print("SGD training time:   " + str(training_time))
        print("SGD training mean squared error: " +  str(training_error[0]))
        print("SGD validation mean squared error:    " + str(validation_error));

SGD training time:   0.3548698425292969
SGD training mean squared error: 1.1306808500346766
SGD validation mean squared error:    [1.07344077]


## Cross Validation

Notice how the last bit of code was in an if statement based on the cross_validation boolean parameter? If that was set to true, then you'd skip the code above and come straight here.

To preprocess the data is a little different... you need to pick how many values you want to be in your test set by indexing properly... then the rest will be split according to the k fold algorithm. We still use the method preprocess_features, just like above though, if you'll notice. 

Furhtermore for the sake of testing our model and implementation on different data, we chose the test set to be the FIRST 1000 features, not the last like it was done above.


In [9]:
if (k_fold_cross_validation):
    text, X, y = redcoms.preprocess_features(redcoms.data[1000:], text_features=text_features, most_common=most_common, stop_words=stop_words, swear_words=swear_words, discrete_swear_words=discrete_swear_words, comment_length=comment_length, binary_text=binary_text)
    text_test, X_test, y_test = redcoms.preprocess_features(redcoms.data[:1000], text_features=text_features, most_common=most_common, stop_words=stop_words, swear_words=swear_words, discrete_swear_words=discrete_swear_words, comment_length=comment_length, binary_text=binary_text)
    
    print(X)
#first 1000 for testing, last 1100 for train/valid split.
        
    foldnum = num_folds
    seg_length = len(X)/num_folds
       
        
    

AttributeError: 'list' object has no attribute 'lower'

Now for the regression step, where it gets a little complicated. We first determine the validation data. Once we have that we  our training data to everything that's left.  We then take that training data, use the regress_crossv() method in our RedditComments class, and input the initial weights along with our other old inputs from the original regress() method. 

Once the regression is complete, we use those weights to check out how it performs on the validation data. Finally we append both training and validation mse's an array. 

Once that iteration is done, we start another. We slide the validation data over, and make the training data equal to what's rest. In regression_crossv, we input the previous iterations determined training weights. We append our results and keep repeating until the validation set has reached the end of what's available

In [8]:
if (k_fold_cross_validation):
    tempX = np.copy( X)
    tempy = np.copy(y)
       #starting weights
    training_weights = np.zeros([X.shape[1], 1])
    mean_train_mse_arr = np.array([])
    mean_valid_mse_arr = np.array([])
    for i in range(foldnum-1):
           
        seg_length = int(seg_length)
        X_val = X[(i) * seg_length: (i+1)*seg_length]
        y_val = y[(i) * seg_length: (i+1)*seg_length]
        X_train = np.delete(tempX, [range((i) * seg_length,  (i+1)*seg_length)], axis=0)
        y_train = np.delete(tempy, [range((i) * seg_length,  (i+1)*seg_length)], axis=0)
           
           #get training weights, input last iteration's training weights. 
        training_weights, training_error, training_time = redcoms.regress_crossv( closed_form, X_train, y_train, training_weights, text, args )    
         #  training_weights = training_weights[:, 0]
        valid, squared_error_array = redcoms.get_mse_crossv(X_val, y_val, training_weights)
        mean_train_mse_arr = np.append( mean_train_mse_arr, training_error)
        mean_valid_mse_arr = np.append(mean_valid_mse_arr, valid)
         #  test_error, square_error_array = redcoms.get_mse(X_test, y_test, training_weights)
        print("Fold : " + str(i+1) + "\nTraining MSE: \t" + str(training_error) + "\nValidation MSE:\t : " + str(valid))
       
       
    print ("\nmean of training: " + str(np.mean(mean_train_mse_arr)))
    print("mean of valid :    " + str(np.mean(mean_valid_mse_arr)))
       
       
    test_err, squared_error_array = redcoms.get_mse_crossv(X_val, y_val, training_weights)
    print ("testing MSE:  " + str(np.mean(test_err)))

NameError: name 'X' is not defined