# Exercise
Please complete the following **two questions** E1 and E2. **You should submit an "ipynb" file to Canvas**  (When you have completed your answer here you can download it using "File" > "Download .ipynb").

## E1. What are two limitations of using one-hot encoding?

In [21]:
#Lab01 - E1

Answer = """There are two significant limits to using one-hot encoding:
    - One-hot encoding produces high-dimensional, sparse matrices that perform poorly or slowly with many common machine learning methods, including those common in natural language processing.
    - For NLP in particular, one-hot encoding leaves no way to represent word similarity.
""" #@param {type:"raw"}

## E2. Calculate TF-IDF and search the Wiki page.

In this exercise, we will practise TF-IDF calculation using documents from [wikipedia library](https://pypi.org/project/wikipedia/), which is a Python library that makes it easy to access and parse data from Wikipedia. Based on the calculated TF-IDF, we then search the Wiki page for the word that has the top-1 TF-IDF value.

In [22]:
try:
    %conda install wikipedia
except:
    %pip3 install wikipedia

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.


In [30]:
## Import packages:
import re
from collections import defaultdict
from math import log

import numpy as np
import wikipedia
from nltk import word_tokenize
from nltk.corpus import stopwords as sw

# Prepare data
cities = ["Sydney","Melbourne","Brisbane"]
corpus = [wikipedia.page(city).content for city in cities]

stopwords = sw.words('english')
stopwords.extend([city.lower() for city in cities])

## TF-IDF-oriented function:
def tfidf_wikisearch(corpus):
  if not hasattr(corpus, '__iter__'):
    corpus = [corpus]

  tf_idf = {}

  # process the documents 
  t_corpus = list()
  for doc in corpus:
    t_corpus.append([
      word for word 
      in word_tokenize(re.sub(r'[^\w\s]','',doc).lower()) 
      if word not in stopwords
    ])

  # calculate document frequency
  DF = defaultdict(int)
  for doc in t_corpus:
    for word in np.unique(doc):
      DF[word] += 1

  # calculate the TF-IDF values for each unique words
  N = len(corpus)
  doc_id = 0
  for doc in t_corpus:
    for word in DF:
      tf = doc.count(word)
      idf = log(N/(DF[word]+1)+1)
      tf_idf[doc_id, word] = tf*idf

    doc_id += 1

  # sorting the words based on the TF-IDF valuse and get the word with top-1 TF-IDF value

  sorted_tf_idf = [
    (doc, word, freq) 
    for ((doc, word), freq)
    in sorted(tf_idf.items(), key=lambda x: x[1], reverse = True)
  ]

  # search the wiki page for this word and print out the page content 

  doc, most_common_word, freq = sorted_tf_idf[0]
  print("Total docs in documents", len(corpus))
  print("The word with top-1 TF-IDF value is:", most_common_word, f"({freq:.2f})")
  print("The retreived wikpage content:")
  print()
  print("===", wikipedia.page(most_common_word).title, "===")
  print(wikipedia.page(most_common_word).content[:5000])

  return tf_idf
 
# Call the funtion, the execution print out log should be kept for submission
result = tfidf_wikisearch(corpus)


Total docs in documents 3
The word with top-1 TF-IDF value is: queensland (67.93)
The retreived wikpage content:

=== Queensland ===
Queensland (locally , KWEENZ-land) is a state situated in northeastern Australia, and is the second-largest and third-most populous Australian state. It is bordered by the Northern Territory, South Australia, and New South Wales to the west, south-west and south respectively. To the east, Queensland is bordered by the Coral Sea and the Pacific Ocean. To its north is the Torres Strait, separating the Australian mainland from Papua New Guinea. With an area of 1,852,642 square kilometres (715,309 sq mi), Queensland is the world's sixth-largest sub-national entity, and is larger than all but 15 countries. Due to its size, Queensland's geographical features and climates are diverse, including tropical rainforests, rivers, coral reefs, mountain ranges and sandy beaches in its tropical and sub-tropical coastal regions, as well as deserts and savanna in the semi-