# Support Vector Machine
#### Linear Regression using scikit-learn's support vector machine
Stephen Wight - author

### Project Aims

This is one portion of the Yelp Reviews project being undertaken by George Avitesyan, Alex Buckalew, Desmong Henderson, Michael Sriqui, and Stephen Wight

The aim is to create multiple algorithms using different ML concepts, then merge them in an ensemble-learning process. This notebook represents one ML algorithm - the Support Vector Machine and its Linear Regression implementation.

#### Importing Libraries
##### Scikit-Learn
This project leans heavily on the Scikit-Learn libraries for machine learning. It implements:
* Support Vector Machine
    * Linear Regression
* Feature Extraction
    * Text
        * TF/IDF Vectorizer
* Model Selection
    * Train/Test Split

In [1]:
from sklearn.svm import LinearSVR
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

##### Contractions, json, and string.punctuation
I'm also using a module, 'contractions', to expand any contractions that are in the text.

The data is stored in JavaScript Object Notation (JSON), so to access it, I need to use a library that interprets the JSON objects and returns them as strings.

Next, I'm importing a constant which includes the most commonly used punctuation, so that I can strip punctuation marks from the text.

Finally, I'm using the pickle library, which allows me to save a python object as a binary file that I can import later. This is useful for moving my trained model from this demonstration into a predicition algorithm. For this purpose, I only need the "dumps" function.

In [2]:
import contractions
from json import loads
from string import punctuation
from pickle import dumps

### Custom Functions
#### demotify()
To preprocess my data, I needed a few custom functions. The first is "demotify", which replaces common emoticon strings like :) and :( with words like 'emhappy' and 'emsad', so that they don't get stripped away with the punctuation in a later step.

In [3]:
def demotify(indoc:str) -> str:
    """demotify - remove select emoticons from strings, then return the strings

    This function removes any emoticons (the precursor to graphical emoji) from 
    a string of text, replacing them with a word that represents their emotional
    content.

    Args:
        indoc: This is the string that may contain emoticons to be replaced.

    Returns:
        str: This function returns the string with emoticons replaced by words


    """

    emoticon_dict = {
        ':)': ' emhappy ',
        ':-)':' emhappy ',
        ':(': ' emsad ',
        ':-(':' emsad ',
        '>:(':' emangry ',
        ':D': ' emgrin ',
        ':-D':' emgrin ',
        ';)': ' emwink ',
        ';-)':' emwink '
    }

    for key in emoticon_dict:
        indoc = indoc.replace(key, emoticon_dict[key])

    return indoc

#### getsample()

Getsample is a pretty important function. This is where I collect single reviews and their related star ratings and preprocess them. I process them in the following order:
* demotify() - replace emoticons
* convert to lowercase
* contractions.fix() - expand contractions
* str.translate - remove punctuation

In addition, this function is set up to select a varied sample-size from varied starting points in the data.

This function is also where I would normally check for language using a library called langdetect, but Alex has already preproccessed the entire dataset and filtered only those which detect as being in English. That section has been commented out for this demonstration.

In [4]:
def getsample(infile, numsamples, startpoint=0):

    ### Declare the variable outside of the loops
    outlist = []
    with open(infile, 'rt') as fl_in:

        ## Enumerate is nice here, because it does not load the entire file into memory, but rather loads a line at a time.
        for i,x in enumerate(fl_in):
            ## Keep moving until you get to the start point
            if i < startpoint:
                continue

            ## When the enumerate function reaches the desired start point, let the user know
            if i == startpoint:
                print(f'Beginning at {i}')

            ## It's useful with long processes to have an output that tells the user that things are still working!
            ## This line prints a status every 10000 lines.
            if len(outlist) % 10000 == 0: print(f'\rRetrieved {len(outlist)}th record.', end='')

            ## Select only the portions of the line that we need
            x = dict(loads(x))

            ## Remove emoticons
            x['text'] = demotify(x['text'])

            ## Check for english language
            # try:
            #     if detect(x['text']) != 'en':
            #         continue
            # except:
            #     continue

            ## convert to lowercase
            x['text'] = x['text'].lower()

            ## expand contractions
            x['text'] = contractions.fix(x['text'])

            ## remove punctuation
            x['text'] = x['text'].translate(str.maketrans('','',punctuation))

            ## Add the current line to the list to be returned
            outlist.append(x)

            ## Check to see if we have enough samples. If we do, break the loop. If not, let it continue.
            if len(outlist) >= numsamples:
                print(f'\n{len(outlist)} samples retrieved.')
                break

    ## Return our completed list.    
    return outlist

### The Training Function
This function encapsulates the entire training process. I call this function with an untrained model, a filename, a sample size, and a start location (so it can pass them to the getsample function), and it returns a trained model and an accuracy score.

In [5]:
def training(model:LinearSVR, filein:str, samplesize:int, startloc:int=0, tts=0.8) -> tuple:

    ## get a sample from the dataset 
    data = getsample(filein, samplesize, startloc)

    ## Split off the test variable
    text = []
    stars = []
    for record in data:
        text.append(record['text'])
        stars.append(record['stars'])

    ## Initialize the TF/IDF vectorizer
    vec = TfidfVectorizer()

    ## Fit and transform the vectorizer 
    vectors = vec.fit_transform(text)

    ## Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(vectors,stars,test_size=tts)

    model.fit(X_train,y_train)
    r_squared = model.score(X_test, y_test)

    return model, r_squared

## Benchmarks
Now that all of the functions are laid out, let's approach training and tuning our model's hyperparamenters.

I want to begin with a very small dataset, just to be sure everything is working.
#### Sample Size - 1000
I'll start with 1000 samples right off the top, and I'll use all of the default settings on the model.

In [6]:
## Everything here is the default settings.
model = LinearSVR(
    epsilon=0.0,
    tol = 1e-4,
    C=1.0,
    loss='epsilon_insensitive',
    fit_intercept=True,
    intercept_scaling=1.0,
    dual=True,
    verbose=0,
    random_state=None,
    max_iter=1000
)

svr_default_1000, score = training(
    model = model, 
    filein = '/home/srwight/Documents/Revature/Group Project/Yelp/nlp/english_only_reviews.json', 
    samplesize = 1000
)

print(f'This model represents {round(score*100,2)}% of the sentiment in the text.')

Beginning at 0
Retrieved 0th record.
1000 samples retrieved.
This model represents 17.93% of the sentiment in the text.


#### Sample Size 1,000 - result
As one would expect, with only 1,000 samples, the model did not perform very well. But that's to be expected. A 1000 piece sample is really just to be sure everything's running smoothly.

#### Sample Size - 10,000
This will likely have a similar result - but the hope is that the accuracy improves significantly, even if it doesn't reach production-level standards.

In [7]:
## Everything here is the default settings.
model = LinearSVR(
    epsilon=0.0,
    tol = 1e-4,
    C=1.0,
    loss='epsilon_insensitive',
    fit_intercept=True,
    intercept_scaling=1.0,
    dual=True,
    verbose=0,
    random_state=None,
    max_iter=1000
)

svr_default_10000, score = training(
    model = model, 
    filein = '/home/srwight/Documents/Revature/Group Project/Yelp/nlp/english_only_reviews.json', 
    samplesize = 10000
)

print(f'This model represents {round(score*100,2)}% of the sentiment in the text.')

Beginning at 0
Retrieved 0th record.
10000 samples retrieved.
This model represents 54.14% of the sentiment in the text.


#### Sample Size 10,000 Results
The accuracy of the model jumped by a factor of 3. It is now representing twice as much of the sentiment in the text as it did with only 1000 samples.

## BENCHMARK
#### Sample Size 100,000

The benchmark for our team is at the 100,000 sample mark. After running at 100,000 samples, we can begin adjusting the hyperparameters.

In [8]:
## Everything here is the default settings.
model = LinearSVR(
    epsilon=0.0,
    tol = 1e-4,
    C=1.0,
    loss='epsilon_insensitive',
    fit_intercept=True,
    intercept_scaling=1.0,
    dual=True,
    verbose=0,
    random_state=None,
    max_iter=1000
)

svr_default_100000, score = training(
    model = model, 
    filein = '/home/srwight/Documents/Revature/Group Project/Yelp/nlp/english_only_reviews.json', 
    samplesize = 100000
)

print(f'This model represents {round(score*100,2)}% of the sentiment in the text.')

Beginning at 0
Retrieved 90000th record.
100000 samples retrieved.
This model represents 63.47% of the sentiment in the text.


## BENCHMARK RESULTS
#### 100,000 sample results

With 100,000 samples, the default settings on the LinearSVR model result in a 63.47% R-squared rating. Given that this is roughly 3% of the available data, I feel that this is a good start. 

## Hyperparameters

Hyperparameters are the small adjustments one can make when declaring a Machine Learning model.

### Tangential Parameters
The LinearSVR model has ten hyperparameters, but only some can be adjusted. Here are a few that should remain as they are:

#### fit_intercept (boolean) - default True
*fit_intercept* is simply telling the algorithm whether the data has already been centered. I have made no effort to center the data, so this needs to stay **True**.
#### verbose (boolean) - default False
This variable does not affect the performance of the model - it only determines whether the model prints status updates as it works. While those might be interesting, they won't affect performance. When I get closer to production-level, I may turn this value on so that I can see a little more about what's going on inside.
#### random_state (integer) - default None
The only reason to set this to a specific number is to force the 'random' variables to be exactly the same every time. This can be useful when you're comparing runs. For later tests, I will set *random_state* to a specific value so I can isolate other changes.

### Operational Parameters
Of the remaining seven, a few can be adjusted independently of the others. They include:
#### tol - float, default 0.0001
*tol* represents tolerance for stopping. This represents how close to the best possible fit we will require the algorithm to get. making this number smaller should increase accuracy, but it will also increase processing time.
#### C - float, default 1
*C* represents how much we're willing to let the best fit line curve to incorporate the data. Increasing this value will increase accuracy, but it risks overfitting. If overfitting is already a problem, lowering C might combat it. This will become very useful as our dataset size increases.
#### dual - boolean, default **True**.
*dual* selects whether the algorithm solves the 'dual' or 'primal' optimization problem. Basically, with *dual* set to true, the algorithm is regressing from both sides, while with 'primal', it is only regressing from one. The recommendation is that *dual* be set to **False** when there are more samples than there are features. I believe there are around 100,000 words represented in the text, so with 100,000 samples, setting this to **True** is reasonable. However, as I increase the sample size, setting this to **False** could be useful. Of note, when set to **False**, the algorithm does not use any random values, so random_state is ignored.
#### max_iter - integer, default 1000
*max_iter* is simply a maximum number of times the algorithm can attempt to regress before it takes what it has as the best it can do. Increasing this value might increase accuracy, but it will also increase processing time.

Two of the parameters are linked: **epsilon** and **loss**
#### loss - string, default 'epsilon_insensitive'
This specifies which loss function to use - L1 or L2. 

A support vector machine depends on a margin. As we attempt to minimize the cost function, points farther away from that margin cause a penalty to the cost function, and that information informs later iterations of the algorithm.

L1 is as follows:

$\frac{1}{2}||w||^2 + C\sum_{i=1}^{M}\epsilon$

L2 is as follows:

$\frac{1}{2}||w||^2 + \frac{C}{2}\sum_{i=1}^{M}\epsilon^2$

L2 imposes a stronger penalty, so it may be more effective, but it may also risk overfitting. When epsilon is zero, this setting doesn't matter.

#### epsilon - float, default 0.0
This represents the margin used in the cost function above.


## Experiments with Hyperparameters

### tol

In this iteration of our project, we decrease the tolerance of the model by a factor of ten. My hypothesis is that this will have some effect, but it will also require increasing max_iter to achieve full effect. I haveverbose** = 1, so that we can weee what kind of output it gives.

In [10]:
model = LinearSVR(
    epsilon=0.0,
    tol = 1e-5,
    C=1.0,
    loss='epsilon_insensitive',
    fit_intercept=True,
    intercept_scaling=1.0,
    dual=True,
    verbose=1,
    random_state=None,
    max_iter=1000
)

svr_tol_e5_100000, score = training(
    model = model, 
    filein = '/home/srwight/Documents/Revature/Group Project/Yelp/nlp/english_only_reviews.json', 
    samplesize = 100000
)

print(f'This model represents {round(score*100,2)}% of the sentiment in the text.')

Beginning at 0
Retrieved 90000th record.
100000 samples retrieved.
[LibLinear]This model represents 63.28% of the sentiment in the text.


The results are statistically similar to the benchmark results. Also, verbose(1) does not tell us much.

I have set the tolerance another factor of ten smaller.

In [11]:
model = LinearSVR(
    epsilon=0.0,
    tol = 1e-6,
    C=1.0,
    loss='epsilon_insensitive',
    fit_intercept=True,
    intercept_scaling=1.0,
    dual=True,
    verbose=0,
    random_state=None,
    max_iter=1000
)

svr_tol_e6_100000, score = training(
    model = model, 
    filein = '/home/srwight/Documents/Revature/Group Project/Yelp/nlp/english_only_reviews.json', 
    samplesize = 100000
)

print(f'This model represents {round(score*100,2)}% of the sentiment in the text.')

Beginning at 0
Retrieved 90000th record.
100000 samples retrieved.
This model represents 63.32% of the sentiment in the text.


There was still no real change. Perhaps an increase in max_iter would help.

In [12]:
model = LinearSVR(
    epsilon=0.0,
    tol = 1e-6,
    C=1.0,
    loss='epsilon_insensitive',
    fit_intercept=True,
    intercept_scaling=1.0,
    dual=True,
    verbose=0,
    random_state=None,
    max_iter=10000
)

svr_tol_e6_iter_e4_100000, score = training(
    model = model, 
    filein = '/home/srwight/Documents/Revature/Group Project/Yelp/nlp/english_only_reviews.json', 
    samplesize = 100000
)

print(f'This model represents {round(score*100,2)}% of the sentiment in the text.')

Beginning at 0
Retrieved 90000th record.
100000 samples retrieved.
This model represents 63.33% of the sentiment in the text.


The improvements are minimal. In the next experiments, these will return to default.

### C

This parameter has the potential to quickly overfit our model and significantly increase processing time. I will proceed with caution.

In [13]:
model = LinearSVR(
    epsilon=0.0,
    tol = 1e-4,
    C=2,
    loss='epsilon_insensitive',
    fit_intercept=True,
    intercept_scaling=1.0,
    dual=True,
    verbose=0,
    random_state=None,
    max_iter=1000
)

svr_C2_100000, score = training(
    model = model, 
    filein = '/home/srwight/Documents/Revature/Group Project/Yelp/nlp/english_only_reviews.json', 
    samplesize = 100000
)

print(f'This model represents {round(score*100,2)}% of the sentiment in the text.')

Beginning at 0
Retrieved 90000th record.
100000 samples retrieved.
This model represents 61.28% of the sentiment in the text.


Increasing C actually decreased the accuracy of the model. I suspect overfitting. Perhaps the other direction?

In [14]:
model = LinearSVR(
    epsilon=0.0,
    tol = 1e-4,
    C=0.5,
    loss='epsilon_insensitive',
    fit_intercept=True,
    intercept_scaling=1.0,
    dual=True,
    verbose=0,
    random_state=None,
    max_iter=1000
)

svr_C_0_5_100000, score = training(
    model = model, 
    filein = '/home/srwight/Documents/Revature/Group Project/Yelp/nlp/english_only_reviews.json', 
    samplesize = 100000
)

print(f'This model represents {round(score*100,2)}% of the sentiment in the text.')

Beginning at 0
Retrieved 90000th record.
100000 samples retrieved.
This model represents 63.93% of the sentiment in the text.


Reducing C to 0.5 represented a moderate increase in accuracy, but still fairly small. Still, it is worth pursuing.

C=0.1

In [15]:
model = LinearSVR(
    epsilon=0.0,
    tol = 1e-4,
    C=0.1,
    loss='epsilon_insensitive',
    fit_intercept=True,
    intercept_scaling=1.0,
    dual=True,
    verbose=0,
    random_state=None,
    max_iter=1000
)

svr_C_0_1_100000, score = training(
    model = model, 
    filein = '/home/srwight/Documents/Revature/Group Project/Yelp/nlp/english_only_reviews.json', 
    samplesize = 100000
)

print(f'This model represents {round(score*100,2)}% of the sentiment in the text.')

Beginning at 0
Retrieved 90000th record.
100000 samples retrieved.
This model represents 57.82% of the sentiment in the text.


C of 0.1 decreased accuracy by a large amount. Split the difference.


In [18]:
model = LinearSVR(
    epsilon=0.0,
    tol = 1e-4,
    C=0.3,
    loss='epsilon_insensitive',
    fit_intercept=True,
    intercept_scaling=1.0,
    dual=True,
    verbose=0,
    random_state=None,
    max_iter=1000
)

svr_C_0_5_100000, score = training(
    model = model, 
    filein = '/home/srwight/Documents/Revature/Group Project/Yelp/nlp/english_only_reviews.json', 
    samplesize = 100000
)

print(f'This model represents {round(score*100,2)}% of the sentiment in the text.')

Beginning at 0
Retrieved 90000th record.
100000 samples retrieved.
This model represents 63.61% of the sentiment in the text.


This seems to be a little low still. Let's try 0.8.

In [9]:
model = LinearSVR(
    epsilon=0.0,
    tol = 1e-4,
    C=0.8,
    loss='epsilon_insensitive',
    fit_intercept=True,
    intercept_scaling=1.0,
    dual=True,
    verbose=0,
    random_state=None,
    max_iter=1000
)

svr_C_0_5_100000, score = training(
    model = model, 
    filein = '/home/srwight/Documents/Revature/Group Project/Yelp/nlp/english_only_reviews.json', 
    samplesize = 100000
)

print(f'This model represents {round(score*100,2)}% of the sentiment in the text.')

Beginning at 0
Retrieved 90000th record.
100000 samples retrieved.
This model represents 63.46% of the sentiment in the text.


It seems the default setting is best for this parameter as well.

In [16]:
model = LinearSVR(
    epsilon=0.0,
    tol = 1e-4,
    C=1.0,
    loss='squared_epsilon_insensitive',
    fit_intercept=True,
    intercept_scaling=1.0,
    dual=False,
    verbose=0,
    random_state=None,
    max_iter=1000
)

svr_dual_false_E2i_100000, score = training(
    model = model, 
    filein = '/home/srwight/Documents/Revature/Group Project/Yelp/nlp/english_only_reviews.json', 
    samplesize = 100000
)

print(f'This model represents {round(score*100,2)}% of the sentiment in the text.')

Beginning at 0
Retrieved 90000th record.
100000 samples retrieved.
This model represents 63.93% of the sentiment in the text.


In [19]:
model = LinearSVR(
    epsilon=0.0,
    tol = 1e-4,
    C=0.8,
    loss='epsilon_insensitive',
    fit_intercept=True,
    intercept_scaling=1.0,
    dual=True,
    verbose=0,
    random_state=None,
    max_iter=1000
)

svr_C_0_5_100000, score = training(
    model = model, 
    filein = '/home/srwight/Documents/Revature/Group Project/Yelp/nlp/english_only_reviews.json', 
    samplesize = 100000
)

print(f'This model represents {round(score*100,2)}% of the sentiment in the text.')

Beginning at 0
Retrieved 90000th record.
100000 samples retrieved.
This model represents 63.34% of the sentiment in the text.


This is marginally better, but not statistically so. Still, its moving in the right direction. And given the likely increase in sample size soon, it seems a viable option.

In [None]:
model = LinearSVR(
    epsilon=0.0,
    tol = 1e-4,
    C=1.0,
    loss='squared_epsilon_insensitive',
    fit_intercept=True,
    intercept_scaling=1.0,
    dual=False,
    verbose=1,
    random_state=None,
    max_iter=1000
)

svr_dual_false_E2i_1000000, score = training(
    model = model, 
    filein = '/home/srwight/Documents/Revature/Group Project/Yelp/nlp/english_only_reviews.json', 
    samplesize = 1000000
)

print(f'This model represents {round(score*100,2)}% of the sentiment in the text.')