### **Group 4**: Viktoria Konya, Peter Endes-Nagy, Khawaja Hassan, Shah Ali

### **Task 1**

Use the text similarity notebook and do the following: 1) lemmatize the tokens, 2) change min_df. How do your results look like?

In [1]:
documents = ["This little cat came to play when I was eating at a restaurant. I had to take a photo.",
             "Merley has the best squooshy kitten belly.",
             "Google Translate app is incredible.",
             "If you open 100 tabs in google chrome you get a smiley face.",
             "Best cat photo I've ever taken.",
             "Climbing ninja cat.",
             "Impressed with the current version of the google translate app.",
             "Key promoter extension for Google Chrome."]

In [2]:
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
stopwords = nltk.corpus.stopwords.words('english')
import re
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
from nltk.corpus import wordnet
import pandas as pd

#### Text preprocesser - Lemmatization

In [3]:
# Lemmatizer

def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

def text_preprocesser_lemmatize(text):
    
    # Remove special characters
    text = re.sub(r'\W',' ', text)
    
    # Remove not alphabet characters
    text = re.sub("[^a-zA-Z]+", " ", text)
    
    # Lowercase and tokenize
    tokens = [word.lower() for word in nltk.word_tokenize(text)]
    
    # Remove stopwords
    tokens = [token for token in tokens if token not in stopwords]
    
    # Remove words with length less than 3 characters
    tokens = [token for token in tokens if len(token)>=3]
    
    lemma = [lemmatizer.lemmatize(token, get_wordnet_pos(token)) for token in tokens]
    
    # Join
    preprocessed_text = ' '.join(lemma)

    return preprocessed_text

#### Apply text preprocesser

In [4]:
# Check results
[text_preprocesser_lemmatize(document) for document in documents]

['little cat come play eat restaurant take photo',
 'merley best squooshy kitten belly',
 'google translate app incredible',
 'open tab google chrome get smiley face',
 'best cat photo ever take',
 'climb ninja cat',
 'impressed current version google translate app',
 'key promoter extension google chrome']

#### Apply TF-IDF vectorizer

We will:

* ignore terms that appear in less 2 documents

In [5]:
tfidf_vectorizer = TfidfVectorizer(preprocessor=text_preprocesser_lemmatize, min_df = 2)
tfidf_vectorizer

TfidfVectorizer(min_df=2,
                preprocessor=<function text_preprocesser_lemmatize at 0x7f8d39e71670>)

In [6]:
tfidf = tfidf_vectorizer.fit_transform(documents)
df = pd.DataFrame(tfidf.toarray().transpose(), index=tfidf_vectorizer.get_feature_names())
df 

Unnamed: 0,0,1,2,3,4,5,6,7
app,0.0,0.0,0.623489,0.0,0.0,0.0,0.623489,0.0
best,0.0,1.0,0.0,0.0,0.516768,0.0,0.0,0.0
cat,0.520868,0.0,0.0,0.0,0.445928,1.0,0.0,0.0
chrome,0.0,0.0,0.0,0.797471,0.0,0.0,0.0,0.797471
google,0.0,0.0,0.471725,0.603358,0.0,0.0,0.471725,0.603358
photo,0.603613,0.0,0.0,0.0,0.516768,0.0,0.0,0.0
take,0.603613,0.0,0.0,0.0,0.516768,0.0,0.0,0.0
translate,0.0,0.0,0.623489,0.0,0.0,0.0,0.623489,0.0


With the transformation 'taken' was merged with 'take' which increased the frequency of this token.

Let's check the TF-IDF vectorizes with another lower ceiling value: in at least 3 documents

In [7]:
tfidf_vectorizer = TfidfVectorizer(preprocessor=text_preprocesser_lemmatize, min_df=3)
tfidf = tfidf_vectorizer.fit_transform(documents)
df = pd.DataFrame(tfidf.toarray().transpose(), index=tfidf_vectorizer.get_feature_names())
df 

Unnamed: 0,0,1,2,3,4,5,6,7
cat,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
google,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0


The dataset is too small for meaningful experimenting. Cat and google are the only words in more than 3 documents and each is in 4-4 different documents. As there is no document containing both of these "top" words, we have practically binary dummy variables. With larger datasets (longer texts), it's quite an unlikely result, but neverthless an interesting insight. 