**02452** *Machine Learning*, Technical University of Denmark

- This Jupyter notebook contains exercises where you fill in missing code related to the lecture topic. *First*, try solving each task yourself. *Then* use the provided solution (an HTML file you can open in any web browser) as inspiration if needed. If you get stuck, ask a TA for help.

- Some tasks may be difficult or time-consuming - using the solution file or TA support is expected and perfectly fine, as long as you stay active and reflect on the solution.

- You are not expected to finish everything during the session. Prepare by looking at the exercises *before* the class, consult the TAs *during* class, and complete the remaining parts *at home*.

---

# Week 1: Introduction, data and visualization


**Content:**

- Part 1: Loading data in Python - the Iris flower data set
- Part 2: Cleaning up data in Python
- Part 3: Text-representation in Python


**Objectives:**
- Be able to import data into Python and represent the it in the course format of $\boldsymbol{X}, \boldsymbol{y}$.
- Be able to do common preprocessing steps for datasets.
- Understand the bag of words representation for text documents including filtering methods based on removal of stop words and stemming.
- Understand some of the many ways data can be visualized including histograms, boxplots, and scatter plots.

In [None]:
import numpy as np
import pandas as pd
from scipy.io import loadmat

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer

# Plotting style
sns.set_style('darkgrid')
sns.set_theme(font_scale=1.)

## Introduction

In this exercise we will take a closer look on ways to load data in Python and explore some of the many data visualization techniques we can apply to better understand the content of a dataset. 

We will use a standard data representation throughout this course, by collecting $N$ individual data points $\boldsymbol{x}_i = \left[x_1, x_2, \dots x_M\right]^\top$ with $M$ attributes/features in a data matrix $\boldsymbol{X}$ of size $N\times M$. This means that the $i$'th row in $\boldsymbol{X}$ corresponds to the $i$'th data point, while the $j$'th column contains all observations of the $j$'th attribute. As discussed in the lecture, these attributes can be discrete or continuous and of the types nominal, ordinal, interval or ratio. 

Later in the course we will use the data matrix $\boldsymbol{X}$ for solving both 1) **predictive tasks** where we aim to predict unknown values of some target attribute and 2) **descriptive tasks** where we aim at finding human-interpretable patterns that describe the data.

- Using $\boldsymbol{X}$ for predictive tasks is what we call **supervised learning** and consists of learning a mapping between the attributes in $\boldsymbol{X}$ and a **target attribute**. We will here call $\boldsymbol{X}$ the **input data** and denote the target attribute by $\boldsymbol{y}=\left[y_1, y_2, \dots y_N\right]^\top$ such that there is a target value associated to each $i$'th input data point. We can think of supervised learning as finding a function, $f$, such that our predictions $\hat{\boldsymbol{y}} = f\left(\boldsymbol{X}\right)$ are "close" to the true values in $\boldsymbol{y}$. If the target attribute is discrete we solve a **classification task** while we solve a **regression task** if the target attribute is continues.

- Using $\boldsymbol{X}$ for descriptive tasks is known as **unsupervised learning** and is used for exploratory analyses of the data. Examples of unsupervised learning techniques are 1) clustering techniques that relates to finding hidden group structures $\boldsymbol{X}$, 2) **anomaly detection** that considers finding abnormal data points or 3) **representation learning** where we seek some meaningful latent representations of the data.

In the following exercise, we will get familiar with loading datasets distributed as various types of files into the $\left(\boldsymbol{X}, \boldsymbol{y}\right)$-format described above. We will generally load data into Pandas dataframes as the Pandas library provides flexible tools for data analysis, except when working with more complex data types as text and images.

<br>

---

## Part 1: Loading data in Python - the Iris flower dataset

We consider the Iris flower dataset (downloaded [here](http://archive.ics.uci.edu/ml/datasets/Iris)) - or Fisher's Iris dataset - is a multivariate dataset introduced by Sir Ronald Aylmer Fisher (1936) for the problem of classifying Iris flower types.It is sometimes called Anderson's Iris dataset because Edgar Anderson collected the data to quantify the geographic variation of Iris flowers in the Gaspé Peninsula. The dataset consists of $N=50$ samples from each of $C=3$ species of Iris flowers (Iris setosa, Iris virginica and Iris versicolor). Each observation has $M=4$ attributes measured: the length and the width of sepal and petal, in centimetres, hence $\boldsymbol{x}_i = (x_1,x_2,x_3,x_4)$. Based on the combination of the four variables, Fisher developed a model to distinguish the species from each other - it is used as a typical test for many other classification techniques (see [here](http://en.wikipedia.org/wiki/Iris_flower_data_set)).

A simple format of storing data is the comma-separated values-file format (or CSV). In such files, a sample or an observation is a line in a text document and the document then has as many lines (or rows) as there are samples, i.e. $N$. The attribute values for an observation is written within one line, separated by (usually) a comma or a tab-character in a consistent order. This order is usually defined in a header (the first line of the file), which has a designation of the variable name in some format.

**Task 1.1:** Inspect the `iris.csv` file from the associated data folder. Load the CSV-file using Pandas and split it into the standard $\left(\boldsymbol{X}, \boldsymbol{y}\right)$-format. Make sure to keep $\boldsymbol{X}$ and $\boldsymbol{y}$ as Pandas data types.
> *Hint:* Open the CSV-file using e.g. Notepad for Windows or TextEdit for MacOS.

> *Hint:* Use `df = pd.read_csv()` to load the CSV-file into a Pandas dataframe. Then split it into $\left(\boldsymbol{X}, \boldsymbol{y}\right)$, e.g. using `df.drop()` and `df[target_attr_name]`.

> *Hint:* The class label (the flower species) are stored as text (or strings). Convert it into a categorical attribute using `pd.Categorical()`. Type `help(pd.Categorical)` if you get lost.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Check the shape of the data
N, M = X.shape
assert N == 150, "There should be 150 samples in the Iris dataset."
assert M == 4, "There should be 4 features in the Iris dataset."

# Display the first few rows of the dataframe
X.head()

Sometimes datasets are distributed as Excel-files `.xls(x)`. 

**Task 1.2:** Load the same Iris data, when it has been stored as an Excel-file using Pandas. Split it into the standard $\left(\boldsymbol{X}, \boldsymbol{y}\right)$-format as before.
> *Hint:* Open `iris.xls` in the data folder to have a look at the file.

> *Hint:* Use `df = pd.read_excel()` to load an Excel-file into a Pandas dataframe. For furhter information, type `help(pd.read_excel)`

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Check the shape of the data
N, M = X.shape
assert N == 150, "There should be 150 samples in the Iris dataset."
assert M == 4, "There should be 4 features in the Iris dataset."

# Display the first few rows of the dataframe
X.head()

Other times data is stored as MATLAB files (`.mat`). 

**Task 1.3:** Load the same Iris data, when it has been stored in the MATLAB-file `iris.mat`. 
> *Hint:* Use `data = scipy.io.loadmat()` to load the data file. Check the data structure, e.g. using `data.keys()`, to see what information it contains.

**Task 1.4:** Split it into the standard $\left(\boldsymbol{X}, \boldsymbol{y}\right)$-format and convert $\boldsymbol{X}$ and $\boldsymbol{y}$ to Pandas datatypes.
> *Hint:* In MATLAB-files, strings are stored in Numpy arrays. To extract the string information, use a list comprehension on the form: `[val.item() for val in data['key_in_the_dictionary'].flatten()]`

> *Hint:* Use `pd.Dataframe()` to construct a Pandas dataframe from a `np.Array`. Type `help(pd.DataFrame)` to figure out how to name the columns.

> *Hint:* You should construct $\boldsymbol{y}$ as a `pd.Categorical` with the class names as the values. The information is available in `data`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Check the shape of the data
N, M = X.shape
assert N == 150, "There should be 150 samples in the Iris dataset."
assert M == 4, "There should be 4 features in the Iris dataset."

# Display the first few rows of the dataframe
X.head()

For some modeling tasks, working with Pandas dataframes is cumbersome. Luckily, we can easily convert Pandas datatypes into numerical of type `np.Array` if we stored the data correctly. In the following cell, we do so for the Iris dataset:

In [None]:
# Convert X and y to numpy arrays
X_numpy, y_numpy = X.values, y.codes

# Print the first 5 samples of X and y
print(f"X: {X_numpy[:5, :]}")
print(f"\ny: {y_numpy[:5]}")

In the examples up until now, we have handled the data in the Iris dataset as if to solve a classification problem. We could say that the **primary machine learning modelling aim** is to classify the species of Iris flower based on the petal and sepal dimensions. However, we could also use the dataset to illustrate how to do regression without needing to use a whole different dataset. We would achieve this by e.g. trying to predict either of the petal (or sepal) dimensions based on the remaining dimensions, for instance. This changes how we define our $\left(\boldsymbol{X}, \boldsymbol{y}\right)$-format.

**Task 1.5:** Cast the Iris dataset into a regression problem. To do so, set up the $\left(\boldsymbol{X}, \boldsymbol{y}\right)$-format in Pandas such that we are predicting the petal lengths from the other continuous attributes.
> *Hint:* Use `df.drop()` to remove the target attribute and non-continuous variable.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Check the shape of the regression data
N_reg, M_reg = X_regression.shape
assert N_reg == 150, "There should be 150 samples in the Iris dataset."
assert M_reg == 3, "There should be 3 features in the Iris regression dataset."

# Display the first few rows of the regression dataframe
X_regression.head()

#### Basic plotting in Python

In the following we will do an initial data analysis of the Iris dataset through basic plots of the attributes. We will recreate the plots in section 7.1 of the course book.

**Task 1.8:** Plot histograms of the four attributes using `plt.subplots()`. Argue from the graph that the petal length is either between 1 and 2 cm. or between 3 and 7 cm. but that no flowers in the dataset have a petal length between 2 and 3 cm. Do you think this could be useful to discriminate between the different types of flowers?
> *Hint:* Check the documentation using `help(plt.subplots)`.

> *Hint:* Use `plt.hist()` to plot a histogram.

> *Hint:* Use indexing to extract each attribute. For example, `df.iloc[:, j-1]` extracts the $j$'th attribute.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**Task 1.9:** Produce a boxplot of the four attributes in the Iris data. This boxplot shows the same information as the histogram in the previous exercise. Discuss the advantages and disadvantages of the two types of plots.
> *Hint:* Use the function `plt.boxplot()` for creating the figure.

- *Answer:* 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**Task 1.10:** Create a figure using `plt.subplots()` that contains boxplots for each attribute for each class as in Figure 7.2 in the course book. Show on the graph that all the Iris-setosa in this dataset have a petal length between 1 and 2 cm. Do you think we would be able to distinguish between the Iris types from the measured sepal and petal length? Why/why not?

> *Hint:* You can split the Pandas dataframe into subsets based on a specific attribute using `df.groupby()`.

> *Hint:* Make sure to specify `sharey=True` when creating the subplot structure. This allows us to compare the values more easily.

- *Answer:* 

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**Task 1.11:** Create a matrix scatter plot using `plt.subplots()` of each combination of two attributes against each other as shown in in Figure 7.6 in the course book. Say you want to discriminate between the three types of flowers using only the length and width of either sepal or petal. Argue from the graph why it would be better to use petal length and width rather than sepal length and width.
> *Hint:* To make a scatter plot, you can use the function `plt.scatter()` and fill the plots with two for-loops. Remember to keep track of the axes!

> *Hint:* Another way to extract the data of a specific class, is to write `df.query('Type == "Iris-setoa"')`.

> *Hint:* For the diagonal plots, we could also choose more informative plots than the scatter plot. Try filling it with the histograms.

- *Answer:*


In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**Task 1.12:** Produces a 3-dimensional scatter plot of three attributes as shown in Figure 7.7 in the course book. Try rotating the data. Can you find an angle where the three types of flower are separated in the plot? Discuss the pros and cons of visualizing data in 2 and 3 dimensions, respectively? How would you plot data that is inherently 4 dimensional or higher?
> *Hint:*  You can add a 3D subplot element by `ax = fig.add_subplot(111, projection='3d')`. Read more about plotting in 3 dimensions [here](matplotlib.sourceforge.net/mpl_toolkits/mplot3d/tutorial.html).

> *Hint:* In Jupyter notebooks, you can open the plots in a separate window (for interaction) by writing the cell magic `%matplotlib qt` in the top of the cell. Contrarily, what you've done so far is inline plotting with `%matplotlib inline`.

- *Answer:* 

In [None]:
%matplotlib qt 
# This opens an interactive window for 3D plotting. 
# If you have problems use `%matplotlib inline` instead.

# YOUR CODE HERE
raise NotImplementedError()

**Task 1.13:** Apply standardization to the data matrix $\boldsymbol{X}$ that we constructed in the first part of the exercise so that it has zero mean and unit standard deviation. Plot the standardized data matrix as an image. What does this plot indicate?
> *Hint:* You can use the function `plt.imshow()` to plot an image. Check out the documentation of this function.

> *Hint for interpreting the plot:* the data matrix is ordered according to the sorted class labels!

In [None]:
%matplotlib inline 
# go back to inline plotting

# YOUR CODE HERE
raise NotImplementedError()

A question that we need to ask ourselves is whether the dataset is suitable for solving our primary machine learning aim, which we here define as being classification. 

**Task 1.14:** Do you think that we would be able to fit a classification model on the Iris dataset? Argue based on the figures you generated. What if we change the primary machine learning aim to be the regression task mentioned earlier?

> *Hint:* Consider which attributes we defined as *target* and *input* attributes, respectively, for the two modeling tasks.

- *Answer:* 

You are welcome to try out other plotting methods for the data. Matplotlib online repository is a good source of inspiration: https://matplotlib.org/stable/gallery/index.html

<br>

---

## Part 2: Cleaning up data in Python

While the Iris dataset is a real dataset, it is a very clean and easy to work with dataset. Usually, data is a bit messier, and we will consider a toy dataset that has some common issues. Often, the description of "real-world" data is stored along with the data in some form of a text file. Have a look at the folder `messy_data` in the data folder and read more about the toy dataset - notice that there is a `README.txt` file.

**Task 2.1:** Inspect the data in `messy_data.data` and try to identify some issues (use e.g. simple text editor as before). What issues did you find?

- *Answer:* 

**Task 2.2:** Load the messy dataset using `pd.read_csv()`.
> *Hint:* Even though `messy_data.data` is not a `.csv`-file, we can load it by specifying the argument `sep=\t` as the values are tab-separated.

> *Hint:* What index is the header in the file? Specify this with the `header` argument.

> *Hint:* To remove the header from the values of the dataframe, use `messy_data.drop()`

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Check the shape of the messy data
N, M = messy_data.shape
assert N == 29, "There should be 29 samples in the messy dataset."
assert M == 9, "There should be 9 features in the messy dataset."

# Display the messy dataframe
messy_data

At this point, youll see that some of the missing values from the data has already been represented as NaNs (in the displacement column). However, these were only the places where an empty element was in the file. We also see that the weight attribute uses ' as the thousand separator which is bad practice in Python. For the acceleration attribute we even see inconsistency in whether commas or dots are used as the decimal separator - in Python, we use dots.

**Task 2.3:** Remove the question marks in displacement and replace them with not a number, i.e. `NaN` and solve the problems of the separator signs for the weight and acceleration attribute.
> *Hint:* Use the method `.str.replace("?", "NaN")` to modify the specific attribute on the string level for each data input. Use the same function for handling the separator issues.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In the `README.txt` it is stated that "zeroes in the attributes MPG and displacement can be considered missing values". 

**Task 2.4:** Replace the zeros in these attributes, since a zero might be correct for some other variables.
> *Hint:* You can use `.replace` like before, but do not need to do it on the string level.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

The `README.txt` does not supply a lot of information about what the levels of the "origin" attribute describe, so we either have to make an educated guess based on the values in the context, or preferably obtain the information from any papers that might be references in the `README`. From inspection of "origin" and "car names", you should see that (north) American cars are valued by 1, European cars are valued by 2 and Asian cars are valued by 3.

**Task 2.5:** Convert the "origin" attribute to a categorical attribute labeled as described above. Next, apply one-out-of-$K$ encoding to the attribute (as we did with the Iris data) and remove the `carname` attribute.
> *Hint:* Convert the attribute to a categorical Pandas attribute. The you can use the method `.rename_categories()`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

We later on find out that a value of 99 for the MPG is not value that is within reason for the MPG of the cars in this dataset, hence it is an outlier. The observations that has this value of MPG is therefore incorrect, and we should treat the value as missing. So far, the data has been of string type, however, it is easier to define filters if we convert the numerical attributes to being numerical.

**Task 2.6:** Convert the numerical attributes into numerical-valued columns in the Pandas datafram. Add a line of code to remove the data point (rows) where MPG $= 99$.
> *Hint:* You can index multiple columns of the dataframe using a list of column names, i.e. `messy_data[[name1, name2, ...]]`

> *Hint:* Use `.astype(float)` to convert the string-valued attributes to numerical.

> *Hint:* For filtering out the outlier, you can create a mask like `messy_data.mpg != x` and apply it to the dataframe.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

N, M = messy_data.shape
assert N == 28, "There should be 28 samples in the cleaned messy dataset."

We still have the missing values. In the following we will go through how you could go about handling the missing values before making the $\left(\boldsymbol{X},\boldsymbol{y}\right)$-matrices as above. Various apporaches can be used, but it is important to keep it mind to never do any of them blindly. Keep a record of what you do, and consider/discuss how it might affect your modelling.

The simplest way of handling missing values is to drop any records that display them, we do this by first determining where there are missing values.

**Task 2.7:** Using Python, determine which observations (rows) contain missing values. Next, remove all observations that holds at least 1 missing value. Store the resulting dataframe in a variable called `clean_data`.
> *Hint:* You can use `.isna().any()` to identify observations with NaN values. What axis should `.any()` be applied on?

> *Hint:* The operator `~` negates the values of a boolean array, i.e. if `A = np.array([True, False])` then `~A = np.array([False, True])`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Check the shape of the cleaned data
N, M = clean_data_v1.shape
assert N == 15, "There should be 24 samples in the cleaned messy dataset."
assert M == 10, "There should be 10 features in the cleaned messy dataset after one-out-of-K encoding."

# Display the cleaned dataframe
clean_data_v1

Another approach to handling missing values is to check whether the majority of missing values comes from specific attributes. By visual inspection (either from plotting the dataframe using `plt.imshow()` or by checking the dataframe values), we see that the third column, i.e. the displacement attribute, is the major reason we have missing values.

**Task 2.8:** Go back to the `messy_data` dataframe. Remove the displacement attribute (for now) and then remove the few observations (rows) containing missing values. Store the resulting dataframe in a variable `clean_data_v2`.
> *Hint:* Another way to remove rows with missing values is using the Pandas method `.dropna()`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Check the shape of the cleaned data
N, M = clean_data_v2.shape
assert N == 26, "There should be 26 samples in the cleaned messy dataset after removing displacement."
assert M == 9, "There should be 9 features in the cleaned messy dataset after removing displacement."

# Display the cleaned dataframe
clean_data_v2

Lastly, one could impute the missing values - which means to "guess them", in some sense - while trying to minimize the impact of the guess. A simply way of imputing them is to replace the missing values with the median of the attribute. In our specific case, we would have to do this for the missing values for attributes MPG and displacement.

**Task 2.9:** Go back to the `messy_data` dataframe. Replace missing values in the MPG and displacement columns with their respective median values. Store the resulting dataframe in a varible called `clean_data_v3`.
> *Hint:* For computing the median of an attribute containing NaN-values, use `np.nanmedian` or the Pandas method `.median()`. We will take a closer look on summary statistics next week.

> *Hint:* Checkout the Pandas method `.fillna()`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

N, M = clean_data_v3.shape
assert N == 28, "There should be 28 samples in the cleaned messy dataset after filling missing values."
assert M == 10, "There should be 10 features in the cleaned messy dataset after filling missing values."

**Task 2.10:** Which of the methods do you prefer? Which of the cleaned data versions contains most information? Why is this useful in a machine learning context?

Perfect! Now our data is cleaned up and we would actually be able to use it for solving a regression or classification task! Let's try to construct the $\left(\boldsymbol{X}, \boldsymbol{y}\right)$ matrices. One idea could be to try to predict the weight of the cars based on the remaining attributes. This means that we will have to solve a regression problem.

**Task 2.11:** Split the `clean_data_v3` into the $\left(\boldsymbol{X}, \boldsymbol{y}\right)$-format such that the target attribute is "weight".

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Given the data split, we will sometimes need to consider common feature transformations such as standardization and binarization/thresholding. Standardization can be useful for being able to compare data attributes measured on very different scales and works by subtracting the mean and dividing by the standard deviation for each attribute, respectively, and sometimes it is also important to standardize the target attribute itself! On the other hand, binarization and thresholding can be used to e.g. construct a discrete classification target attribute from a continuous-valued attribute.

**Task 2.12:** Create a standardized version of the data matrix called `X_standardized`. Do the same for the target attribute and store it in `y_standardized`. Lastly, discretize the weight target attribute into 3 categories being "low", "medium" and "high".
> *Hint:* Use Pandas methods `.mean()` and `.std()` when standardizing $\boldsymbol{X}$ and $\boldsymbol{y}$. What axis should you compute the values over?

> *Hint:* To discretize a continuous attribute with Pandas, use `pd.cut()`. Check the documentation with `help(pd.cut)` for further information.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Check the standardized features and target variable
assert (X_standardized.mean(axis=0) < 1e-5).all(), "The mean of the standardized features should be close to 0."
assert (abs(X_standardized.std(axis=0) - 1) < 1e-5).all(), "The standard deviation of the standardized features should be close to 1."
assert (y_standardized.mean() < 1e-5), "The mean of the standardized target variable should be close to 0."
assert abs(y_standardized.std() - 1) < 1e-5, "The standard deviation of the standardized target variable should be close to 1."

**Optional:** Below we show an example of how to use the dataset to fit a supervised learning method. In this scenario we consider a linear regression model. In the coming weeks, we will get to know many more supervised learning techniques. We evaluate the predictions using the root-mean-square error - how can we interpret this value?

In [None]:
from sklearn.linear_model import LinearRegression

# Example of fitting a supervised learning model, e.g. a linear regression model, using sklearn
model = LinearRegression()      # define the model
model.fit(X, y)                 # fit the model to the data
y_hat = model.predict(X)        # predict the target variable using the model on all data

# Compute the RMSE (Root Mean Squared Error)
# this is a common metric for regression tasks as it measures the average magnitude 
# of the errors between predicted and actual values.
RMSE = np.sqrt(np.mean(np.power(y_hat - y, 2)))
print(f"RMSE: {RMSE:.2f}")

---

## Part 3: Text-representation in Python (Optional)

In the previous tasks, we learned how to load and manipulate tabular data as well as some techniques for visualizing the attributes. However, not all data follows the tabular data structure and we will presently consider one such example.

An important area of research in machine learning and data mining is the analysis of text documents. Here, important tasks are to be able to search documents as well as group related documents together (clustering). In order to accomplish these tasks the text documents is converted into a format suitable for data modeling. We will use the **bag of words** representation. Here, text documents are stored in a matrix $\boldsymbol{X}$ where $x_{ij}$ indicate how many times word $j$ occurred in document $i$.

Suppose that we have 5 text documents, each containing just a single sentence:

> Document 1: The Google matrix $P$ is a model of the internet.
>
> Document 2: $P_{ij}$ is nonzero if there is a link from webpage $i$ to $j$.
>
> Document 3: The Google matrix is used to rank all Web pages.
>
> Document 4: The ranking is done by solving a matrix eigenvalue problem. 
>
> Document 5: England dropped out of the top 10 in the FIFA ranking. 

**Taske 3.1:** Propose a suitable **bag of words** representation for these documents (use pen and paper). You should choose approximately 10 key words in total defining the columns in the document-term matrix and the words are to be chosen such that each document at least contains 2 of your key words, i.e. the document-term matrix should have approximately 10 columns and each row of the matrix must at least contain 2 non-zero entries.
- *Answer:* 

In practice, we can carry out this procedure automatically using the scikit-learn library, or `sklearn`. We will use a function from the feature extraction-module, called `CountVectorizer` to generate a document-term matrix and to convert it into the course format, i.e. $\boldsymbol{X}$. Note, that we will use the words "term" and "token" interchangeably.

**Task 3.2:** Inspect the `textDocs.txt`-file provided in the associated data folder.

As you might have seen, the data is no longer following a fixed structure as in the CSV-files that we previously worked with. Instead of using Pandas for loading the data we will directly read the lines of the `txt`-file. Read and understand the code for loading the documents in the cell below.

In [None]:
# Open the txt-file and read its content
with open('data/BoW/textDocs.txt', 'r') as f:
    raw_file = f.read()

# The raw file is a single string with all content of the file
# We need to split it into individual documents/sentences using \n as the delimiter
corpus = raw_file.split('\n')

# Next, we remove the empty lines from the corpus
corpus = list(filter(None, corpus))

# Display the content of the corpus
print("Corpus (5 documents/sentences):")
print(np.asmatrix(corpus))
print()

**Task 3.3:** Construct the document-term matrix as a `np.array` using `CountVectorizer` and compare the generated document-term matrix to the one you generated yourself.
> *Hint:* Make sure that you have the `sklearn`-package installed.

> *Hint:* Read more about the `CountVectorizer` using `help(CountVectorizer)` after it has been imported from `sklearn`.

In [None]:
# We define a CountVectorizer to convert the corpus into a document-term matrix.
# The token pattern is a regular expression (marked by the r), which ensures
# that the vectorizer ignores digit/non-word tokens - in this case, it ensures
# the 10 in the last document is not recognized as a token. It's not important
# that you should understand it the regexp.
vectorizer = CountVectorizer(token_pattern=r"\b[^\d\W]+\b")

# YOUR CODE HERE
raise NotImplementedError()

# Check the shape of the document-term matrix
N, M = X.shape
assert N == 5, "There should be 5 documents in the corpus."
assert M == 36, "There should be 36 terms in the document-term matrix."

print("Number of documents (data points, N):\t %i" % N)
print("Number of terms (attributes, M):\t %i" % M)
print("\nTerms in the document-term matrix:")
print(terms)
print("\nDocument-term matrix:")
print(X)

Stop words are words that one can find in virtually any document. Therefore, the occurrence of such a word in a document does not distinguish the document from other documents. The following is the beginning of one particular stop word list:
> a, a's, able, about, above, according, accordingly, across,   actually, after, afterwards, again, against, ain't, all, allow,  allov, ahnost, alone, along, already, also, although, always, am,  among, amongst, an, and, another, any, anybody, anyhow, anyone,  anything, anyway, anyways, anywhere, apart, appear, appreciate,  appropriate, are, around, as, aside,ask, ....

When forming the document-term it is common to remove these specified stop words.

The generated document-term matrix contains words that carry little information such as the word "the". We will remove these words as they can be interpreted as "noise" carrying no information about the content of the documents. 

**Task 3.4:** Load the stop words from the `stopWords.txt`-file in the associated data folder. Compute a new document-term matrix with stop words removed - how does it compare to your original
matrix?

> *Hint:* You can load the stop words similarly to how we loaded the documents. 

> *Hint:* Once the stop words are loaded, they can be parsed to the `CountVectorizer` using the keyword `stop_words`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

# Check the shape of the document-term matrix
N, M = X.shape
assert N == 5, "There should be 5 documents in the corpus."
assert M == 19, "There should be 19 terms in the document-term matrix."
print("Number of documents (data points, N):\t %i" % N)
print("Number of terms (attributes, M):\t %i" % M)

# Show the document-term matrix as a Pandas dataframe for better overview
pd.DataFrame(X, columns=terms)

Stemming denotes the process for reducing inflected (or sometimes derived) words to their stem, base or root form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Clearly, from the point of view of information retrieval, no information is lost in the following stemming reduction:
$$
    \begin{equation*}
    \left.\begin{array}{l}
        \text{computable}\\
        \text{computing}\\
        \text{computed}\\
        \text{computational}\\
        \text{computation}\\
    \end{array}\right\}\rightarrow \text{comput}
\end{equation*}
$$

Document 3, 4 and 5 have the word "rank" in common. However in document 4 and 5 this word is stored as a the separate word entry "ranking" in the document-term matrix whereas in document 3 it is stored as the word entry "rank". As such, the document-term matrix does not indicate that document 3, 4 and 5 share the word "rank". By the use of stemming we can obtain a matrix that indicate that the word "rank" appears in all 3 documents. In the following cell, we will show you how to apply the the `PorterStemmer` from the `nltk` package and create a new document-term matrix.
> Make sure you have the `nltk`-package installed.

**Task 3.5:** Inspect the document-term matrix after stemming. How does it compare to your original matrix?

In [None]:
# We'll use a widely used stemmer based:
# Porter, M. “An algorithm for suffix stripping.” Program 14.3 (1980): 130-137.
from nltk.stem import PorterStemmer

# Make an object based on the PorterStemmer class
stemmer = PorterStemmer()
# Construct an analyzer for generating the document-term matrix
analyzer = CountVectorizer(
    token_pattern=r"\b[^\d\W]+\b", 
    stop_words=stopwords
).build_analyzer()

# Using this we make a function that can stem words:
def stemmed_words(doc):
    return (stemmer.stem(w) for w in analyzer(doc))

# ... and finally, we create a new vectorizer just like we've done before:
vectorizer_with_stemming = CountVectorizer(analyzer=stemmed_words)

# Fit the vectorizer to the corpus
vectorizer_with_stemming.fit(corpus)
# Extract the terms/tokens from the vectorizer
terms = vectorizer_with_stemming.get_feature_names_out()
# Transform the corpus into the document-term matrix
X = vectorizer_with_stemming.transform(corpus).toarray()

# Check the shape of the document-term matrix
N, M = X.shape
assert N == 5, "There should be 5 documents in the corpus."
assert M == 18, "There should be 18 terms in the document-term matrix."
print("Number of documents (data points, N):\t %i" % N)
print("Number of terms (attributes, M):\t %i" % M)

# Show the document-term matrix as a Pandas dataframe for better overview
pd.DataFrame(X, columns=terms)

Based on our document-term representation we can now make simple searches (queries) in our documents based on some form of similarity measure between our query vector and document-term representation. Lets say we want to find all documents that are relevant to the query "**solving** for the **rank** of a **matrix**.'' This is represented by a query vector, $\boldsymbol{q}$, constructed in a way analogous to the document-term matrix, $\boldsymbol{X}$, hence
$$
    \boldsymbol{q} = \left[\begin{array}{ccccccccccccccccc}
      0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 & 1 & 1 & 0 & 0 & 0
    \end{array}\right]^\top
$$

We will use the **cosine distance** as a measure of similarity between the $i$'th document $\boldsymbol{x}_i$ and the query vector $\boldsymbol{q}$, i.e. 
$$
  \mathrm{cos}(\boldsymbol{q},\boldsymbol{x}_i)=\frac{\boldsymbol{q}}{\|\boldsymbol{q}\|}\cdot\frac{\boldsymbol{x}_i}{\|\boldsymbol{x}_i\|} =\frac{\boldsymbol{q}^\top\boldsymbol{x}_i}{\|\boldsymbol{q}\|\|\boldsymbol{x}_i\|}
$$
We will learn much more about measures of similarity next week. 

**Task 3.6:** Compute the cosine similarity between each document and the query using a) pen and paper (i.e. compute the inner products between the relevant vectors) and b) using `numpy`. Explain what documents, according to our similarity measure, are most related to the query and verify that Document 4 is the most similar one.
> *Hint:* You can extract a document (row of the `X` matrix) using the command `x=X[i, :]` `i` is the index of the document.

> *Hint:* Numpy matrices and arrays can be transposed using notation `x.T` or `x.transpose()`.

> *Hint:* Dot products between two row vectors can be computed as `np.dot(q, x.T)` (or simply `q @ x.T`).

> *Hint:* The norm of a vector can be computed using the function `np.linalg.norm()`.

In [None]:
# Define the query vector
q = np.array([0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0])
# Print terms in the query vector
print("Terms in the query vector: ", terms[q == 1])

# notice, that you could get the query vector using the vectorizer, too:
# q = vectorizer_with_stemming.transform(['matrix rank solv'])
# q = np.asarray(q.toarray())
# print(q)
# or use any other string:
# q = vectorizer_with_stemming.transform(['Can I Google how to fix my problem?'])
# q = np.asarray(q.toarray())

# YOUR CODE HERE
raise NotImplementedError()

# Display the result
print("Query vector:\n {0}\n".format(q))
print("Similarity results:\n {0}".format(cosine_similarities))

**Optional:** If you find text processing exciting, read more about Natural Language Processing toolkit. Here is a good place to start: http://www.nltk.org/book/