# 1. Introduction

In this task, the vectorization of the language, more concretely, a field formed by the summary of the review and its text, will be performed, following 4 different configurations:

- TFIDF
- TFIDF + N-grams
- TFIDF + N-grams + POS tagging
- TFIDF + N-grams + POS tagging + Other features

This configurations follow the indications provided in the [task description document](https://github.com/schmidt-marvin/ESI_2022_TecAA/tree/main/task03/provided_files/ML2022_Milestone_3_Task_Definition.pdf). After the vectorization, a feature reduction will be applied to remove 70% of the features, to only keep those relevant ones.

# 2. Preparations

## 2.1. Importing libraries

In [None]:
# Misc
import re
import copy
import pandas as pd
from google.colab import output

# Natural language
import nltk
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download("popular")

# Feature reduction
import numpy as np
from sklearn.feature_selection import SelectKBest, chi2, f_regression

# Data processing
from sklearn import preprocessing

output.clear()

## 2.2. Importing dataset

In [None]:
# Import from CSV
!wget https://raw.githubusercontent.com/schmidt-marvin/ESI_2022_TecAA/main/task03/intermediate_files/products_preprocessed.csv
output.clear()

df_products_preprocessed = pd.read_csv("products_preprocessed.csv", sep=",", index_col="Id")
df_products_preprocessed.head()

Unnamed: 0_level_0,Summary,Text,Score
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,good quality dog good,i have bought several of the vitality canned d...,5
2,not a advertised,product arrived labelled a lumbo halted peanut...,1
3,delight say it all,his is a connection that ha been around a few ...,4
4,rough medicine,of you are looking for the secret ingredient i...,2
5,great staff,great staff at a great price. there wa a wide...,5


## 2.3. Utility functions

In this subsection, a method will be included to later delete any number from the reviews, as well as a needed constant to limit the number of analyzed reviews.

In [None]:
NUM_REVIEWS = 1000

def remove_numbers(x):
  return re.sub(r'[0-9]+', '', str(x))

# 3. Dataset cleaning

First, the given dataset needs some small changes applied to improve the efficiency and readibility of the results. The first one is to remove all numbers from reviews since they do not add a clear meaning to them, and represent a great part of the vocabulary.

In [None]:
df_products_preprocessed['Summary'] = df_products_preprocessed['Summary'].apply(lambda x: remove_numbers(x))
df_products_preprocessed['Text'] = df_products_preprocessed['Text'].apply(lambda x: remove_numbers(x))

Secondly, some scores given to reviews are not values in a range from 1 to 5, as seen below.

In [None]:
df_wrong_score = df_products_preprocessed[~df_products_preprocessed['Score'].isin(['1','2','3','4','5'])]
df_wrong_score['Score']

Id
282        0
523       47
896        0
1333       8
1396       0
        ... 
49905      6
50134     RN
50469      0
50698      0
50763     10
Name: Score, Length: 230, dtype: object

Since these values are incorrect, we can remove them without further problems, as well as adding the `Review` field, which consists in the union of the `Summary` and `Text` fields.

In [None]:
df_products_preprocessed['Review'] = [str(df_products_preprocessed.iloc[i].Summary) + ' ' + df_products_preprocessed.iloc[i].Text for i in range(len(df_products_preprocessed))]
df_products_preprocessed = df_products_preprocessed[df_products_preprocessed['Score'].isin(['1','2','3','4','5'])]

df_products_preprocessed.reset_index(drop = True, inplace = True)
df_products_preprocessed.head()

Unnamed: 0,Summary,Text,Score,Review
0,good quality dog good,i have bought several of the vitality canned d...,5,good quality dog good i have bought several of...
1,not a advertised,product arrived labelled a lumbo halted peanut...,1,not a advertised product arrived labelled a lu...
2,delight say it all,his is a connection that ha been around a few ...,4,delight say it all his is a connection that ha...
3,rough medicine,of you are looking for the secret ingredient i...,2,rough medicine of you are looking for the secr...
4,great staff,great staff at a great price. there wa a wide...,5,great staff great staff at a great price. the...


We can export this cleant dataframe in case it is needed later on.

In [None]:
df_products_preprocessed.to_csv('products_preprocessed_review.csv')

# 4. Vectorization

In this section, the vectorization will be properly done according to the given configurations.

## 4.1. TFIDF

Since using the whole dataset could cause the RAM memory from Google colab to not be able to keep all reviews in memory, we decided to use a subset of 1000 reviews, which is enough to perform a natural language analysis, according to our size and time limitations. Thanks to the previously imported TfidfVectorizer, this task can be done in a few lines, as seen below: 

In [None]:
# Obtain first 1000 reviews and scores
reviews_tfidf = df_products_preprocessed['Review'][:NUM_REVIEWS].tolist()
scores_tfidf = df_products_preprocessed['Score'][:NUM_REVIEWS].tolist()
scores_int_tfidf = np.array([int(x) for x in scores_tfidf])

# Vectorize with TFIDF
tfidf_vectorizer = TfidfVectorizer(use_idf=True, smooth_idf=True)
vectors_tfidf = tfidf_vectorizer.fit_transform(reviews_tfidf)

df_vectors_tfidf = pd.DataFrame(vectors_tfidf.toarray(), columns = tfidf_vectorizer.get_feature_names())
df_vectors_tfidf



Unnamed: 0,abby,abdominal,able,about,above,absence,absense,absolute,absolutely,absorb,...,zero,zest,zevia,zinc,zing,zip,zippy,zola,zucchini,îtis
0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
996,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
997,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
998,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 4.2. TFIDF + N-grams

In this case, we are going to use the TfidfVectorizer indicating the range for possible values in the n in n-grams, which in our case is going to be 1 and 2, since also including 3 ends up with longer execution times, unable to be handled by the POS tagger, and huge csv files.

In [None]:
# Vectorize with TFIDF and N-grams with n = 2
tfidf_ngrams_vectorizer = TfidfVectorizer(ngram_range=(1,2))
vectors_tfidf_ngrams = tfidf_ngrams_vectorizer.fit_transform(reviews_tfidf)

df_vectors_tfidf_ngrams = pd.DataFrame(vectors_tfidf_ngrams.toarray(), columns = tfidf_ngrams_vectorizer.get_feature_names())
df_vectors_tfidf_ngrams



Unnamed: 0,abby,abdominal,abdominal cramping,able,able to,about,about an,about and,about any,about anything,...,zola,zola or,zucchini,zucchini and,zucchini asparagus,zucchini brown,zucchini had,zucchini organic,îtis,îtis real
0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
996,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
997,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
998,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 4.3. TFIDF + N-grams + POS Tagging

In this case, we decided to use POS tagging to count the number of adjectives present in each review. We believe this information is relevant since adjectives have a very strong meaning, specially in those cases where the review is either very good or very bad. We will tag every word found in the 1000 reviews, to know which of them are adjectives, tagged as `JJ`.

In [None]:
tags = {}

# For every analyzed word in the reviews
for column in df_vectors_tfidf_ngrams.columns:
  ngram = word_tokenize(column)
  column_tags = nltk.pos_tag(ngram)

  if len(column.split()) == 1:
    # 1-gram
    tags.update({column:column_tags[0][1]})

print(tags)



Now that we have every word tagged, we can analyze each of them in every review, and count how many of them have a tag of `JJ`.

In [None]:
df_vectors_tfidf_ngrams_postag = df_vectors_tfidf_ngrams.copy()

num_adjectives = []

for i in range(NUM_REVIEWS):
  words = reviews_tfidf[i].split()
  adj_counter = 0
  for word in words:
    if word in tags.keys():
      if tags[word] == 'JJ':
        adj_counter += 1
  num_adjectives.append(adj_counter)

df_vectors_tfidf_ngrams_postag['num_adjectives'] = num_adjectives
df_vectors_tfidf_ngrams_postag

Unnamed: 0,abby,abdominal,abdominal cramping,able,able to,about,about an,about and,about any,about anything,...,zola or,zucchini,zucchini and,zucchini asparagus,zucchini brown,zucchini had,zucchini organic,îtis,îtis real,num_adjectives
0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
1,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
2,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
3,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
996,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
997,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
998,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2


## 4.4. TFIDF + N-grams + POS Tagging + Other features

Finally, in the last configuration, we are going to add the information of how many words or sentences does a review has, since a big number of them could mean worse reviews, and viceversa.

In [None]:
df_vectors_tfidf_ngrams_postag_other = df_vectors_tfidf_ngrams_postag.copy()

num_words = []
num_sentences = []

for i in range(len(reviews_tfidf)):
  num_words.append(len(reviews_tfidf[i].split()))
  num_sentences.append(len(reviews_tfidf[i].split('. ')))

df_vectors_tfidf_ngrams_postag_other['num_words'] = num_words
df_vectors_tfidf_ngrams_postag_other['num_sentences'] = num_sentences
df_vectors_tfidf_ngrams_postag_other

Unnamed: 0,abby,abdominal,abdominal cramping,able,able to,about,about an,about and,about any,about anything,...,zucchini and,zucchini asparagus,zucchini brown,zucchini had,zucchini organic,îtis,îtis real,num_adjectives,num_words,num_sentences
0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,52,3
1,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,34,2
2,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,98,8
3,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1,43,3
4,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,29,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5,57,3
996,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5,88,5
997,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4,27,1
998,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2,52,4


# 5. Feature reduction

Now, we can go with the feature reduction part, which, in all cases, will be done with the SelectKBest function, with a chi-square score function and reducing the number of features by a 70%.

## 5.1. TFIDF

First, we can see the feature reduction in the TFIDF configuration:

In [None]:
reduced_df_vectors_tfidf = SelectKBest(score_func = chi2, k = int(len(df_vectors_tfidf.columns) * 0.3)).fit_transform(df_vectors_tfidf, scores_int_tfidf)
reduced_df_vectors_tfidf

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Then, we introduce this array in a dataframe and add the score column for the future models to be implemented.

In [None]:
df_tfidf_export = pd.DataFrame(reduced_df_vectors_tfidf)
df_tfidf_export['score'] = scores_int_tfidf
df_tfidf_export

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1709,1710,1711,1712,1713,1714,1715,1716,1717,score
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4


Finally, we export it to a csv file.

In [None]:
df_tfidf_export.to_csv('tfidf.csv')

## 5.2. TFIDF + N-grams

In this case, we need to remove all useless columns for now, like the number of words, sentences, adjectives... And then apply the feature reduction.

In [None]:
col_score = "score"
col_num_words = "num_words"
col_num_sentences = "num_sentences"
col_num_adjectives = "num_adjectives"
df_vectors_tfidf_ngrams_fr = df_vectors_tfidf_ngrams.loc[:, df_vectors_tfidf_ngrams.columns != col_score]
df_vectors_tfidf_ngrams_fr = df_vectors_tfidf_ngrams_fr.loc[:, df_vectors_tfidf_ngrams_fr.columns != col_num_words]
df_vectors_tfidf_ngrams_fr = df_vectors_tfidf_ngrams_fr.loc[:, df_vectors_tfidf_ngrams_fr.columns != col_num_sentences]
df_vectors_tfidf_ngrams_fr = df_vectors_tfidf_ngrams_fr.loc[:, df_vectors_tfidf_ngrams_fr.columns != col_num_adjectives]

reduced_df_vectors_tfidf_ngrams = SelectKBest(score_func = chi2, k = int(len(df_vectors_tfidf_ngrams_fr.columns) * 0.3)).fit_transform(df_vectors_tfidf_ngrams_fr, scores_int_tfidf)
reduced_df_vectors_tfidf_ngrams

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

Then we add the array to a dataframe:

In [None]:
df_tfidf_ngrams_export = pd.DataFrame(reduced_df_vectors_tfidf_ngrams)
df_tfidf_ngrams_export['score'] = scores_int_tfidf
df_tfidf_ngrams_export

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12898,12899,12900,12901,12902,12903,12904,12905,12906,score
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4


And export it with the scores column:

In [None]:
df_tfidf_ngrams_export.to_csv('tfidf_ngrams.csv')

## 5.3. TFIDF + N-grams + POS Tagging

The process in this case is very similar to the previous case.

In [None]:
df_vectors_tfidf_ngrams_postag.drop(columns = ['Score', 'num_words', 'num_sentences', 'num_adjectives'], inplace = True, errors = 'ignore')

reduced_df_vectors_tfidf_ngrams_postag = SelectKBest(score_func = chi2, k = int(len(df_vectors_tfidf_ngrams_postag_fr.columns) * 0.3)).fit_transform(df_vectors_tfidf_ngrams_postag_fr, scores_int_tfidf)
reduced_df_vectors_tfidf_ngrams_postag

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

For a better model performance, we will normalize the column `num_adjectives`, so all values lie within a range from 0 to 1.

In [None]:
min_max_scaler = preprocessing.MinMaxScaler()

df_tfidf_ngrams_postag_export = pd.DataFrame(reduced_df_vectors_tfidf_ngrams_postag)
df_tfidf_ngrams_postag_export['num_adjectives'] = num_adjectives
df_tfidf_ngrams_postag_export['score'] = scores_int_tfidf

df_tfidf_ngrams_postag_export['num_adjectives'] = min_max_scaler.fit_transform(df_tfidf_ngrams_postag_export[['num_adjectives']])

df_tfidf_ngrams_postag_export

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12899,12900,12901,12902,12903,12904,12905,12906,num_adjectives,score
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.08,5
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.08,4
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,2
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.08,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.10,1
996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.10,2
997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.08,5
998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,4


And export the dataframe to a csv file.

In [None]:
df_tfidf_ngrams_postag_export.to_csv('tfidf_ngrams_postag.csv')

## 5.4. TFIDF + N-grams + POS Tagging + Other features

As always, we do the feature reduction with the already set parameters:

In [None]:
df_vectors_tfidf_ngrams_postag_other.drop(columns = ['Score', 'num_words', 'num_sentences', 'num_adjectives'], inplace = True, errors = 'ignore')

reduced_df_vectors_tfidf_ngrams_postag_other = SelectKBest(score_func = chi2, k = int(len(df_vectors_tfidf_ngrams_postag_other.columns) * 0.3)).fit_transform(df_vectors_tfidf_ngrams_postag_other, scores_int_tfidf)
reduced_df_vectors_tfidf_ngrams_postag_other

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

We include the 4 necessary columns in this case:

In [None]:
df_tfidf_ngrams_postag_other_export = pd.DataFrame(reduced_df_vectors_tfidf_ngrams_postag_other)
df_tfidf_ngrams_postag_other_export['num_words'] = num_words
df_tfidf_ngrams_postag_other_export['num_sentences'] = num_sentences
df_tfidf_ngrams_postag_other_export['num_adjectives'] = num_adjectives
df_tfidf_ngrams_postag_other_export['score'] = scores_int_tfidf
df_tfidf_ngrams_postag_other_export

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12901,12902,12903,12904,12905,12906,num_words,num_sentences,num_adjectives,score
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,52,3,4,5
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,34,2,2,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,98,8,4,4
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,43,3,1,2
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,29,4,4,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,57,3,5,1
996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,88,5,5,2
997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,27,1,4,5
998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,52,4,2,4


Since we normalized the values from the `num_adjectives` column before, we will do the same with the columns `num_sentences` and `num_words`.

In [None]:
df_tfidf_ngrams_postag_other_export['num_adjectives'] = min_max_scaler.fit_transform(df_tfidf_ngrams_postag_other_export[['num_adjectives']])
df_tfidf_ngrams_postag_other_export['num_sentences'] = min_max_scaler.fit_transform(df_tfidf_ngrams_postag_other_export[['num_sentences']])
df_tfidf_ngrams_postag_other_export['num_words'] = min_max_scaler.fit_transform(df_tfidf_ngrams_postag_other_export[['num_words']])
df_tfidf_ngrams_postag_other_export

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12901,12902,12903,12904,12905,12906,num_words,num_sentences,num_adjectives,score
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.044084,0.074074,0.08,5
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.023202,0.037037,0.04,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.097448,0.259259,0.08,4
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.033643,0.074074,0.02,2
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.017401,0.111111,0.08,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.049884,0.074074,0.10,1
996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.085847,0.148148,0.10,2
997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.015081,0.000000,0.08,5
998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.044084,0.111111,0.04,4


And finally export the processed dataframe.

In [None]:
df_tfidf_ngrams_postag_other_export.to_csv('tfidf_ngrams_postag_other.csv')

# 6. Usage in upcoming colab files

As in the previous colab file, the results are already exported and uploaded to another site so we can achieve the mentioned "out-of-the-box" functionality.

For TFIDF configuration:

In [None]:
!wget https://raw.githubusercontent.com/schmidt-marvin/ESI_2022_TecAA/main/task03/intermediate_files/tfidf_2.csv
output.clear()

tfidf_csv = pd.read_csv("tfidf_2.csv", sep=",", index_col = 0)
tfidf_csv.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1709,1710,1711,1712,1713,1714,1715,1716,1717,score
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5


For TFIDF + N-grams configuration:

In [None]:
!wget https://raw.githubusercontent.com/schmidt-marvin/ESI_2022_TecAA/main/task03/intermediate_files/tfidf_ngrams_2.csv
output.clear()

tfidf_ngrams_csv = pd.read_csv("tfidf_ngrams_2.csv", sep=",", index_col = 0)
tfidf_ngrams_csv.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12898,12899,12900,12901,12902,12903,12904,12905,12906,score
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5


For TFIDF + N-grams + POS Tagging configuration:

In [None]:
!wget 'https://raw.githubusercontent.com/schmidt-marvin/ESI_2022_TecAA/main/task03/intermediate_files/tfidf_ngrams_postag.csv'
output.clear()

tfidf_ngrams_postag_csv = pd.read_csv("tfidf_ngrams_postag.csv", sep=",", index_col = 0)
tfidf_ngrams_postag_csv.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12899,12900,12901,12902,12903,12904,12905,12906,num_adjectives,score
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.08,5
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.04,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.08,4
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.02,2
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.08,5


For TFIDF + N-grams + POS Tagging + Other features configuration:

In [None]:
!wget 'https://raw.githubusercontent.com/schmidt-marvin/ESI_2022_TecAA/main/task03/intermediate_files/tfidf_ngrams_postag_other.csv'
output.clear()

tfidf_ngrams_postag_others_csv = pd.read_csv("tfidf_ngrams_postag_other.csv", sep=",", index_col = 0)
tfidf_ngrams_postag_others_csv.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,12901,12902,12903,12904,12905,12906,num_words,num_sentences,num_adjectives,score
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.044084,0.074074,0.08,5
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.023202,0.037037,0.04,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.097448,0.259259,0.08,4
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.033643,0.074074,0.02,2
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.017401,0.111111,0.08,5
