# Final Project: Machine Learning Applications
## Bachelor in Data Science and Engineering

Done by:

* Alvaro Viejo Alonso (NIA: 100451677)
* Rodrigo Oliver Coimbra (NIA: 100451788)
* Héctor Tienda Cárdenas (NIA: 100)

## Introduction and explanation

## Text Processing with SpaCy

In [1]:
# We start by importing all the libraries that we will need
# in this section

# before importing make sure you have run the following commands:
# !conda install -c conda-forge spacy
# !conda install -c conda-forge cupy
# !python -m spacy download en_core_web_trf

# Import SpaCy
import spacy
import csv
import pandas as pd

# Load the en_core_web_trf model
nlp = spacy.load("en_core_web_trf")

  from .autonotebook import tqdm as notebook_tqdm


In [26]:
def preprocess_text(text):
    """
    This function is tasked with tokenizing and lemmatizing the text while
    also removing stopwords, punctuation and whitespaces.
    It returns the remaining token as a string.
    """
    doc = nlp(text)
    tokens = [
                token.lemma_.lower().strip() \
                    for token in doc \
                        if not token.is_stop \
                        and not token.is_punct \
                        and token.pos_ != "SPACE"
            ]
    return " ".join(tokens)

def ner(text):
    doc = nlp(text)
    return [ent.label for ent in doc.ents]

def remove_freq_words(doc, freq_threshold, rare_threshold):
    word_freq = doc.count_by(spacy.attrs.LOWER)
    freq_words = set([doc.vocab[w].text for w in word_freq if word_freq[w] > freq_threshold])
    rare_words = set([doc.vocab[w].text for w in word_freq if word_freq[w] < rare_threshold])
    tokens = [token.text for token in doc if token.text not in freq_words and token.text not in rare_words]
    return " ".join(tokens)

sample_df["cleaned_text"] = sample_df["review_text"].apply(preprocess_text)
sample_df["ner"] = sample_df["review_text"].apply(ner)
sample_df["vector"] = sample_df["cleaned_text"].apply(vectorize_text)


In [30]:
sample_df

Unnamed: 0.1,Unnamed: 0,review_text,rating,book_genre,doc,cleaned_text,ner,vector
0,0,"I originally gave this three stars, but it was...",4,children,"originally gave stars , close decided miserly ...",originally give star close decide miserly bump...,"[397, 397, 397, 388, 388, 397, 388, 397, 388, ...",[]
1,1,"they didnt actually quit, they just wrote lett...",4,children,"nt actually quit , wrote letters complaint tel...",nt actually quit write letter complaint tell k...,[],[]
2,2,This story follows a family consisting of a fa...,5,children,"story follows family consisting father , mothe...",story follow family consist father mother boy ...,"[391, 397, 391, 396, 388, 387, 397, 388, 396, ...",[]
3,3,"I don't remember reading this book in school, ...",4,children,"remember reading book school , decided try . l...",remember read book school decide try lois lowr...,[380],[]
4,4,Read for the 2016 YA/MG Book Battle. This book...,5,children,read 2016 ya / mg book battle . book simply ch...,read 2016 ya mg book battle book simply charmi...,"[391, 383, 380, 380, 380, 397, 396, 387, 380, ...",[]
...,...,...,...,...,...,...,...,...
443,443,Seriously dunno what to think.. This was my re...,3,young_adult,seriously dunno think .. reaction novel . conc...,seriously dunno think reaction novel concept b...,"[397, 397, 380, 9191306739292312949]",[]
444,444,"Possible trigger warnings: abuse*, teen pregna...",5,young_adult,"possible trigger warnings : abuse * , teen pre...",possible trigger warning abuse teen pregnancy ...,"[380, 397, 396, 388, 380, 380, 380, 393, 380, ...",[]
445,445,"I really loved this! ""To all the boys I've lov...",5,young_adult,"loved ! "" boys loved "" unique diverse novel wo...",love boy love unique diverse novel wonderful a...,"[388, 380, 380, 380, 380, 380]",[]
446,446,"I wanted to give this book three stars, and I ...",4,young_adult,"wanted book stars , thought half / quarters bo...",want book star think half quarter book john gr...,"[397, 396, 397, 380, 397, 380, 380, 397]",[]


In [19]:
def homogenize(text):
    doc = nlp(text)
    tokens = [token.text.lower() for token in doc if not token.is_stop]
    return " ".join(tokens)

sample_df["doc"] = sample_df["review_text"].apply(homogenize)

In [3]:
sample_df = pd.read_csv("reviews_spoiler_reduced.csv", encoding="utf-8")

doc = [nlp(review_text) for review_text in sample_df.review_text]
# sample_df["doc"] = [nlp(review_text) for review_text in sample_df.review_text]

In [15]:
type(nlp("Hello"))

spacy.tokens.doc.Doc

In [16]:
spacy.displacy.serve(sample_df["doc"].iloc[0], style="dep")




Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [17]:
spacy.displacy.serve(sample_df["doc"].iloc[0], style="ent")


Using the 'ent' visualizer
Serving on http://0.0.0.0:5000 ...



### Tokenization

### Homogenization

### Cleaning

### Vectorization