# Delivery: Linear Models (Part 1) - by Sindre Øyen

In this part 1 of the delivery, I will explore methods of utilizing machine learning to understand "feelings" in texts, to understand whether the text is negatively loaded, or positively. The models used will be trained with the dataset aclImdb.

---
### Exercise 1

In exercise 1, the data is loaded from the training and test sets of the data.

*Initial setups and helper methods*

In [1]:
from sklearn.datasets import load_files
import os

In [2]:
def printInfo(reviews_data, text_data, y_data):
    '''
    Prints information about the input data.

    Parameters:
    reviews_data (sklearn.utils.Bunch): A dictionary-like object that contains the reviews dataset.
    text_data (numpy.ndarray): An array of strings containing the text data.
    y_data (numpy.ndarray): An array of integers containing the labels for the text data.

    Returns:
    None
    '''
    print("Tipo de text_data: {}".format(type(text_data)))
    print("Tipo de y_data: {}".format(type(y_data)))
    print("Cantidad de textos en el conjunto de entrenamiento: {}".format(len(text_data)))
    print("Etiquetas: {}".format(reviews_data.target_names))

    print("text_data[6]:\n{}\n".format(text_data[6]))
    print("y_data[6]: {}\n".format(y_data[6]))
    print("Etiqueta asociada: {}".format(reviews_data.target_names[y_data[6]]))

    print("text_data[25]:\n{}\n".format(text_data[25]))
    print("y_data[25]: {}\n".format(y_data[25]))
    print("Etiqueta asociada: {}".format(reviews_data.target_names[y_data[25]]))

*Creating a general method for loading folders from the IMDB dataset*

In [3]:
# Loading the data
def loadSet(datafolderName: str, shouldPrint: bool = False) -> tuple:
    '''
    Loads the data from the specified folder of the IMDB set.

    Parameters:
    datafolderName (str): The name of the folder that contains the data.
    shouldPrint (bool): A boolean value that indicates if the information about the data should be printed.
    
    Returns:
    tuple: A tuple containing the text data and the labels.
    '''
    datafolder = os.path.join("aclImdb", datafolderName)
    reviews_data = load_files(datafolder)
    text_data, y_data = reviews_data.data, reviews_data.target
    text_data = [doc.replace(b"<br />", b" ") for doc in text_data]
    if shouldPrint:
        printInfo(reviews_data, text_data, y_data)
    return text_data, y_data


*Loading the training and test data, and counting the instances in each one*

In [4]:
# Loading the training set
text_train, y_train = loadSet("train")
# Loading the test set
text_test, y_test = loadSet("test")

print(f"Numero de instancias en el conjunto de entrenamiento: {len(text_train)}")
print(f"Numero de instancias en el conjunto de prueba: {len(text_test)}")

Numero de instancias en el conjunto de entrenamiento: 48840
Numero de instancias en el conjunto de prueba: 25000


### Excercise 2
*Utilizing ***CountVectorizer*** on an example data set of four sentences*

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

The ***CountVectorizer*** takes in datasets in the form of an array of different sentences. It uses these sentences to create a mapping over the unique words in the data-sets and then returns an vector of dimensions *NxW* with *N* being the length of the initial array, and *W* being the count of unique words. For each sentence it maps out how many occurences each unique word has in the given sentence, thus creating a vectorized representation of the correlations of the different sentences, that can be used in ML models.

As you can se by running the code below, this CountVectorizer instance with the example data returns a *4x18* matrix consisting of 4 vectors of length 18, mapping the counts for each word in it's vocabulary, to the sentences. 

In [6]:
cuatro_frases =["Cargamento de oro dañado por el fuego",
              "La entrega de la plata llegó en el camión color plata",
              "El cargamento de oro llegó en un camión",
              "Oro, oro, oro: gritó al ver el camión"]

example_vectorizer = CountVectorizer()
example_vectorizer.fit(cuatro_frases)
example_vector = example_vectorizer.transform(cuatro_frases)
# Print data to analyze the results
print(f"CountVectorizer vocabulary: {example_vectorizer.get_feature_names_out()}")
print(f"\nCountVectorizer matrix:\n{example_vector.toarray()}\n")
print(f"CountVectorizer dimensions: {example_vector.shape}")

CountVectorizer vocabulary: ['al' 'camión' 'cargamento' 'color' 'dañado' 'de' 'el' 'en' 'entrega'
 'fuego' 'gritó' 'la' 'llegó' 'oro' 'plata' 'por' 'un' 'ver']

CountVectorizer matrix:
[[0 0 1 0 1 1 1 0 0 1 0 0 0 1 0 1 0 0]
 [0 1 0 1 0 1 1 1 1 0 0 2 1 0 2 0 0 0]
 [0 1 1 0 0 1 1 1 0 0 0 0 1 1 0 0 1 0]
 [1 1 0 0 0 0 1 0 0 0 1 0 0 3 0 0 0 1]]

CountVectorizer dimensions: (4, 18)


---
### Exercise 3
Using CountVectorizer to vectorize the IMDB training data

To better generalize the data set I am going to utilize the *min_df* and *stop_words* paramts for the CountVectorizer initalizer. First I am going to explain how each of these parameters works. 

##### min_df
*min_df* is a float parameter that defines a rule for when a word should be included in the vectorizer's vocabulary, based on how many of the training documents (or sentences in this case) the word must be in to be included. The floating value is a value between 0 and 1, meaning 0 < *min_df* ≤ 1, where *min_df*=0.01 for example would mean that a word has to be in at least 1% of all sentences to be included in the vocabulary. Editing this parameter can help with reducing overfitting as a result of too specific data that does not occur that much.

##### stop_words
*stop_words* is a parameter that represents a set of words that will be ignored in the vocabulary. The reason for why one would use this parameter is that the count of some words (e.g., filler words such as "in", "at", "this"), does not necessarily represent a "meaning" in the data and treating them as such might lead to unwanted results. The scikit-learn library has a built in ENGLISH_STOP_WORDS set of words to use for the *stop_words* parameter. Using the stop_words parameters leads to fewer features in the vocabulary, leading to better opportunities for a generalized model.

In [61]:
vectorizer = CountVectorizer(min_df=0.015, stop_words='english')
vectorizer.fit(text_train)
X_train = vectorizer.transform(text_train)
X_test = vectorizer.transform(text_test)
print(f"X_train.shape: {X_train.shape}")
print(f"X_test.shape: {X_test.shape}")

X_train.shape: (48840, 1068)
X_test.shape: (25000, 1068)


### Exercise 4
Training classifiers on the vectorized data. Specifically training on the classifiers ***LogisticRegression*** and ***LinearSVC***.

In training these models, I will be utilizing the *C* regularization parameter. This parameter controls the inverse strength of the regularization, meaning that a lower *C* value will lead to stronger regularization and a higher *C* value will lead to less regularization. This effectively means that:
- A low C value can lead to low bias but high variance (or overfitting)
- A high C value can lead to high bias but low variance (or underfitting)

In [62]:
# Training the Logistic Regression model
from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression(C=0.65, random_state=12, max_iter=10000)
logreg.fit(X_train, y_train)


In [63]:
# Training the Linear Support Vector Machine model
from sklearn.svm import LinearSVC

svm = LinearSVC(C=0.65, random_state=12)
svm.fit(X_train, y_train)



In [60]:
# Evaluating the models
print(f"Logistic Regression score: {logreg.score(X_test, y_test)}")
print(f"Linear Support Vector Machine score: {svm.score(X_test, y_test)}")

Logistic Regression score: 0.18896
Linear Support Vector Machine score: 0.16488


### Exercise 5
Run prediction functions such as `predict_proba` and `decision_function` and explain how the values generated by each are calculated. 

- `predict_proba`: uses the softmax function to return a vector of the predicted probabilities for each class

- `decision_function`: calculates confidence scores for which classification should be predicted to each data, if a classification score is >0 it should be predicted

In [43]:
pred_logreg = logreg.predict_proba(X_test[0])
pred_svm = svm.decision_function(X_test[0])

# Printing the classifications
print(f"Logistic Regression predictions:\n{pred_logreg}")
print(f"Linear Support Vector Machine predictions:\n{pred_svm}")

Logistic Regression predictions:
[[0.22862668 0.27536801 0.49600531]]
Linear Support Vector Machine predictions:
[[-0.47640029 -0.43062229 -0.04093294]]
